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Chapter 1 
Introduction 


Sharing images is an essential experience. Be it a drawing carved in rock, a 
painting exposed in a museum, or a photo capturing a special moment, it is 
the sharing that relives the experience stored in the image. Several techno- 
logical developments have spurred the sharing of images in unprecedented 
volumes. The first is the ease with which images can be captured in a digital 
format by cameras, cellphones and other wearable sensory devices. 'l'he sec- 
ond is the Internet that allows transfer of digital image content to anyone, 
anywhere in the world. Finally, and most recently, the sharing of digital 
imagery has reached new heights by the massive adoption of social network 
platforms. All of a sudden images came with tags, and tagging, comment- 
ing, and rating of any digital image has become a common habit. The 
sharing paradigm is lead by users interactions with each other, like forming 
groups of shared interests, sharing messages that convey sentiments, and by 
commenting the photos that have been shared. And consequently, in the 
huge quantity of available media, some of these images are going to become 
very popular, while others are going to be totally unnoticed and end up in 
oblivion. 


1.1 The goal 


Our ultimate goal is to extract information from image collections in social 
networks. In particular, we aim at obtaining tags, i.e. human interpretable 
labels associated to the content at a global level. These can be related 
to objective aspects such as the presence of things, properties and activi- 
ties, or subjective ones such as the sentiments aroused in a viewer or the 
attractiveness of an image. 

Being able to extract this information can have a great impact in several 
applications. First, the retrieval of images from collections can be improved. 
Current image search engines (such as Google or Yahoo), that traditionally 
rely on the associated text data, have recently exploited the visual content 
to improve performance. Similarly, in social networks, they mostly rely 
on user provided metadata in form of tags or textual description. Second, 
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it can ease the browsing of large collections. For instance, through selec- 
tion or summarization of the most attractive and significative photos. In 
particular, sentiments aroused in the viewer can play a role in producing 
significative output. Third, the distribution and enjoyment of contents can 
be improved. Advertising and distribution of content can be more efficient 
when matching content to user preferences. Moreover, to the aim of mini- 
mizing storage costs, images may be replicated according to popularity and 
still maintaining a low latency for unpopular content. For these reasons, 
image retrieval and understanding receive a lot of attention from both the 
scientific community and industry. 

Machine understanding of media is still very poor. While their data pro- 
cessing capabilities are continuously improving (e.g. Moore's law (Moore, 
1965)), stemming information from unannotated multimedia is a challeng- 
ing task. The main hindrance is that machines are able to compute only low 
level features of the data, hardly correlated to the semantics. Tasks such 
as recognizing things, understanding the sentiment induced in the viewer 
or predicting the expected attractiveness of an image, require high level 
features. This is a well-known problem in the literature, formalized as the 
semantic gap (Smeulders et al., 2000): *The semantic gap is the lack of 
coincidence between the information that machines can extract from the 
visual data and the interpretations the user may give to the data.”. Hence 
the ensuing question is: 


How can we fill the semantic gap for multimedia understanding? 


We believe that Social Networks are promising frameworks 
that can fill the gap. Comparing to the classic multimedia databases, 
social networks provide a dilated context where the user is king. Users can 
contribute by providing photos with attached metadata (such as tags, de- 
scription, location) or by expressing interest in others content (e.g. likes, 
comments). In Figures 1.1 and 1.2 we show two examples of such contribu- 
tions in two different social networks. 

Social network contributions are provided by common users. They of- 
ten cannot meet high quality standards related to content association, in 
particular for accurately describing objective aspects of the visual content 
according to some expert's opinion (Dodge et al., 2012). Moreover, when 
subjective components are considered (e.g. sentiments), different users may 
read images differently. 

The most historically exploited pieces of metadata are the social tags 
associated to the images. These tags tend to follow context, trends and 
events in the real world. They are often used to describe both the situation 
and the entity represented in the visual content. So tagging deviations 
due to spatial and temporal correlation to external factors, including user 
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Figure 1.1: Example of a user generated content on social network Instagram. An image 
of a bracelet is associated with a little description and several tags. Several users have 
commented the content. 
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Figure 1.2: Example of a user generated content on social network Flickr. Tags are 
associated to an image of a panoramic view of a mountain. 
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influence, semantics of activity and relationships between tags, are common 
phenomena. Social tags tend to be imprecise, ambiguous, incomplete and 
biased towards personal perspectives (Golder and Huberman, 2006; Sen 
et al., 2006; Sigurbjórnsson and van Zwol, 2008; Kennedy et al., 2006). 

Quite a few researchers have proposed solutions for image annotation 
and retrieval in social frameworks (Li et al., 2015), although the peculiarities 
of this domain have been only partially addressed. 


1.2 Contributions and Organization 


In this thesis we show that the tagged images shared in social media plat- 
forms are promising to resolve the semantic gap. In particular, we focus on 
image annotation and provide a structured survey of methods in social net- 
works with a thorough empirical evaluation of several key methods. Then 
we describe four novel state-of-the-art methods for extracting information, 
that explicitly take into account the social context. 

Two themes can be highlighted. The first one is related to the task 
of objective analysis of images (i.e. recognize things), while the second 
one relates to the tasks of subjective analysis (i.e. recognize the sentiment 
induced in viewers, predict the expected popularity of images). In spite 
of the two themes, the underlying idea of our work is the exploitation of 
social images through the design of features that comprises both the visual 
observation and their tags. Learned or handcrafted, these features provide 
a robust global representation of the content and context. 

The thesis is organized as follows!. Considering the absence of a com- 
prehensive review of annotation and retrieval in social networks, we start 
in Chapter 2 with a structured survey of related work. Although image an- 
notation and retrieval in social networks are a relatively recent direction of 
research, several tasks have been addressed by the multimedia community. 
We survey three linked semantic tasks (i.e. tag assignment, tag refinement 
and tag-based image retrieval) that have seen the most contributions to 
date. Figure 1.3 shows an example of tag refinement of an image and its 
associated user tags. Recognizing a lack of a structured survey in the lit- 
erature, we aimed at giving a reference contribution for future researchers 
in this field. We organize the rich literature of tagging and retrieval in a 
taxonomy to highlight the ingredients of the main works and recognize their 
advantages and limitations. In particular, we structure our survey along the 
line of understanding how a specific method constructs the underlying tag 
relevance function. 

Witnessing the absence of a thorough empirical comparison in the lit- 


1Note that each chapter is written in a self-contained fashion and can be read on its own. 
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Figure 1.3: An example of an image processed with an algorithm of tag refinement. Not 
relevant tags are removed and additional relevant tags are added. 


erature for the three semantic tasks, in Chapter 3 we establish a common 
experimental protocol and successively exert it in the evaluation of key 
methods. Our proposed protocol contains training data of varied scales 
extracted from social frameworks. This permits to evaluate the methods 
under analysis with data that reflect the specificity of the social domain. 
We made the data and source code public so that new proposals for tag 
assignment, tag refinement, and tag retrieval can be evaluated rigorously 
and easily. Taken together with Chapter 2, these efforts should provide an 
overview of the field's past and foster progress for the near future. 


Chapters 4 builds on ideas from the previous chapters to propose a novel 
approach for tag assignment. By considering visual content and the tags 
associated with an image, novel features are automatically learned. A cross- 
model method is proposed to capture the intricate dependencies between 
image content and annotations. We propose a learning procedure based 
on Kernel Canonical Correlation Analysis which finds a mapping between 
visual and textual words by projecting them into a latent meaning space. 
'The learned mapping is then used to annotate new images using advanced 
nearest neighbor voting methods. We evaluate our approach on three pop- 
ular datasets, and show clear improvements over several approaches relying 
on more standard representations. 


Chapter 5 gives an evaluation of the temporal information in web im- 
ages. The idea is to use the temporal gist of annotations to improve tasks 
such as annotation, indexing and retrieval. While visual content, text and 
metadata, are typically used to improve these tasks, here we look at the 
temporal aspect of social media production and tagging. The correlation 
of the time series of the tags with Google searches shows that, for certain 
concepts, web information sources may be beneficial to the annotation of 
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social media. 

Chapters 6 and 7 deal with the non semantic problems of image senti- 
ment analysis and popularity prediction. In particular, Chapter 6 investi- 
gate the use of a multimodal feature learning approach using neural network 
based models such as Skip-gram and Denoising Autoencoders. The task is 
to perform sentiment analysis of micro-blogging content, such as Twitter 
short messages, that are composed by a short text and, possibly, an image. 
A novel architecture that incorporates these models is proposed and tested 
on several standard Twitter datasets. We show that the approach is efficient 
and obtains good classification results. 

By considering that attractiveness of images is related to popularity, in 
Chapter 7 we propose to use visual sentiment features together with three 
novel context features to predict a concise popularity score of social images. 
Experiments on large scale datasets show the benefits of proposed features 
on the performance of image popularity prediction. Moreover, exploiting 
state-of-the-art sentiment features, we report a qualitative analysis of which 
sentiments seem to be related to good or poor popularity. 

Finally, Chapter 8 summarizes the contribution of the thesis and dis- 
cusses avenues for future research. Notice also that the full-list of published 
papers from this thesis is provided in Appendix A. 


Chapter 2 
Literature review of Assignment, Refinement and Retrieval 


This chapter gives an unified survey of related work on the three 
closely linked problems of Tag Assignment, Tag Refinement and 
Tag-based Image Retrieval. While existing works vary in terms of 
their targeted tasks and methodology, they rely on the key func- 
tionality of tag relevance, i.e., estimating the relevance of a spe- 
cific tag with respect to the visual content of a given image and its 
social context. A taxonomy is introduced to structure the growing 
literature, understand the ingredients of the main works, clarify 
their connections and difference, and recognize their merits and 
limitations. ! 


Excellent surveys on content-based image retrieval have been published 
in the past. In their seminal work, Smeulders et al. review the early years up 
to the year 2000 by focusing on what can be seen in an image and introducing 
the main scientific problem of the field: the semantic gap as "the lack of 
coincidence between the information that one can extract from the visual 
data and the interpretation that the same data have for a user in a given 
situation" (Smeulders et al., 2000). Datta et al. continue along this line and 
describe the coming-of-age of the field, highlighting the key theoretical and 
empirical contributions of recent years (Datta et al., 2008). These reviews 
completely ignore social platforms and socially generated images, which is 
not surprising as the phenomenon only became apparent after these reviews 
were published. 

In this chapter, we survey the state-of-the-art of content-based image 
retrieval in the context of social image platforms, with a comprehensive 
treatise of the closely linked problems of image tag assignment, image tag 
refinement and tag-based image retrieval. Similar to (Smeulders et al., 2000) 
and (Datta et al., 2008), the focus of this survey is on visual information, 
but we explicitly take into account and quantify the value of social tagging. 


1Parts of this chapter previously appeared in Li, X., Uricchio, T., Ballan, L., Bertini, M., 
Snoek, C. G. and Del Bimbo, A. (2016). “Socializing the semantic gap: A comparative survey on 
image tag assignment, refinement, and retrieval". ACM Computing Surveys (CSUR), 49(1), 14. 
'The publication is available at http://dx.doi.org/10.1145/2906152 
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2.1 Problems and Tasks 


Social tags are provided by common users. They often cannot meet high 
quality standards related to content association, in particular for accurately 
describing objective aspects of the visual content according to some expert's 
opinion (Dodge et al., 2012). Social tags tend to follow context, trends and 
events in the real world. They are often used to describe both the situation 
and the entity represented in the visual content. So tagging deviations 
due to spatial and temporal correlation to external factors, including user 
influence, semantics of activity and relationships between tags, are common 
phenomena. Social tags tend to be imprecise, ambiguous, incomplete and 
biased towards personal perspectives (Golder and Huberman, 2006; Sen 
et al, 2006; Sigurbjórnsson and van Zwol, 2008; Kennedy et al., 2006). 
Quite a few researchers have proposed solutions for image annotation and 
retrieval in social frameworks, although the peculiarities of this domain 
have been only partially addressed. We categorize existing works into three 
different main tasks and structure our survey along these tasks: 


e Tag Assignment. Given an unlabeled image, tag assignment strives 
to assign a (fixed) number of tags related to the image content (Maka- 
dia et al, 2010; Guillaumin et al., 2009; Verbeek et al., 2010; Tang 
et al., 2011). 


e Tag Refinement. Given an image associated with some initial tags, 
tag refinement aims to remove irrelevant tags from the initial tag list 
and enrich it with novel, yet relevant, tags (Liu et al., 2010; Wu et al., 
2013; Znaidia et al., 2013; Lin et al., 2013; Feng et al., 2014). 


e Tag Retrieval. Given a tag and a collection of images labeled with 
the tag (and possibly other tags), the goal of tag retrieval is to retrieve 
images relevant with respect to the tag of interest (Li et al., 20095; 
Duan et al., 2011; Sun et al., 2011; Gao et al., 2013; Wu et al., 2013). 


Other related tasks such as tag filtering (Zhu et al., 2010; Liu, Yan, Hua 
and Zhang, 2011; Zhu et al., 2012) and tag suggestion (Sigurbjórnsson and 
van Zwol, 2008; Li et al., 20095; Wu et al., 2009) have also been studied. 
As these tasks focus on either cleaning existing tags or expanding them, we 
view them as variants of tag refinement. 


2.2 Scope and Aims 


Existing works in tag assignment, refinement, and retrieval vary in terms 
of their targeted tasks and methodology, making it non-trivial to interpret 
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them within a unified framework. Nonetheless, we reckon that all works 
rely on the key functionality of tag relevance, i.e., estimating the relevance 
of a specific tag with respect to the visual content of a given image and its 
social context. In general terms, relevance should be evaluated considering 
the complementarity of tags. They may be of low interest alone but become 
interesting if in conjunction with others. However in the literature, only 
few methods consider multi-tag relevance evaluation and only for the task 
of multi-tag retrieval (Li et al., 2012; Nie et al., 2012; Borth et al., 2013). 
Hence, we focus on methods that implement the unique-tag relevance model. 


We survey papers that learn from images tagged in social contexts. We 
do not cover traditional image classification that is grounded on carefully 
labeled data. For a state-of-the-art overview in that direction, we refer the 
interested reader to (Everingham et al., 2015; Russakovsky et al., 2015). 
Nonetheless, one may question the necessity of using socially tagged exam- 
ples as training data, given that a number of labeled resources are already 
publicly accessible. An exemplar of such resources is ImageNet (Deng et al., 
2009), providing crowd-sourced positive examples for over 20k classes. Since 
ImageNet employs several web image search engines to obtain candidate im- 
ages, its positive examples tend to be biased by the search results. As ob- 
served by (Vreeswijk et al., 2012), the positive set of vehicles mainly consists 
of car and buses, although vehicles can be tracks, watercraft and aircraft. 
Moreover, controversial images are discarded upon vote disagreement dur- 
ing the crowd sourcing. All this reduces diversity in visual appearance. We 
empirically show in Chapter 3 the advantage of socially tagged examples 
against ImageNet for tag relevance learning. 


Reviews on social tagging exist. The work by Gupta et al. discusses 
papers on why people tag, what influences the choice of tags, and how 
to model the tagging process, but its discussion on content-based image 
tagging is limited (Gupta et al., 2010). The focus of (Jabeen et al., 2015) 
is on papers about adding semantics to tags by exploiting varied knowledge 
sources such as Wikipedia, DBpedia, and WordNet. Again, it leaves the 
visual information untouched. 


Several reviews that consider socially tagged images have appeared re- 
cently. In (Liu, Hua and Zhang, 2011), technical achievements in content- 
based tag processing for social images are briefly surveyed. Sawant et 
al. (Sawant et al., 2011), Wang et al. (Wang, Ni, Hua and Chua, 2012) 
and Mei et al. (Mei et al., 2014) present extended reviews of particular 
aspects, i.e., collaborative media annotation, assistive tagging, and visual 
search re-ranking, respectively. In (Sawant et al., 2011), papers that propose 
collaborative image labeling games and tagging in social media networks are 
reviewed. In (Wang, Ni, Hua and Chua, 2012) the authors survey papers 
where computers assist humans in tagging either by organizing data for 
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manual labelling, improving quality of human-provided tags or recommend- 
ing tags for manual selection, instead of applying purely automatic tagging. 
In (Mei et al., 2014) the authors review techniques that aim for improving 
initial search results, typically returned by a text based visual search engine, 
by visual search re-ranking. These reviews offer resumes of the methods and 
interesting insights on particular aspects of the domain, without giving an 
experimental comparison between the varied methods. 

We notice efforts in empirical evaluations of social media annotation and 
retrieval (Sun et al., 2011; Uricchio et al., 2013; Ballan, Bertini, Uricchio 
and Del Bimbo, 2014). In (Sun et al., 2011), the authors analyze different 
dimensions to compute the relevance score between a tagged image and a 
tag. They evaluate varied combinations of these dimensions for tag-based 
image retrieval on NUS-WIDE, a leading benchmark set for social image 
retrieval (Chua et al., 2009). However, their evaluation focuses only on tag- 
based image ranking features, without comparing content-based methods. 
Moreover, tag assignment and refinement are not covered. In (Uricchio 
et al., 2013; Ballan, Bertini, Uricchio and Del Bimbo, 2014), the authors 
compared three algorithms for tag refinement on the NUS-WIDE and MIR- 
Flickr, a popular benchmark set for tag assignment and refinement (Huiskes 
et al., 2010). However, the two reviews lack a thorough comparison between 
different methods under the umbrella of a common experimental protocol. 
Moreover, they fail to assess the high-level connection between image tag 
assignment, refinement, and retrieval. 


2.3 Foundations 


Our key observation is that the essential component, which measures the 
relevance between a given image and a specific tag, stands at the heart of 
the three tasks. In order to describe this component in a more formal way, 
we first introduce some notation. 

We use x, t, and u to represent the three basic elements in social images, 
namely image, tag, and user. An image x is shared on social media by 
its user u. A user u can choose a specific tag t to label a. By sharing 
and tagging images, a set of users U contribute a set of n socially tagged 
images X, wherein 4, denotes the set of images tagged with t. Tags used 
to describe the image set form a vocabulary of m tags V. The relationship 
between images and tags can be represented by an image-tag association 
matrix D € (0, 1)"*", where D;; = 1 means the i-th image is labeled with 
the j-th tag, and 0 otherwise. 

Given an image and a tag, we introduce a real-valued function that 
computes the relevance between x and t based on the visual content and an 
optional set of user information O associated with the image: 
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folz, t; 0) 


We use © in a broad sense, making it refer to any type of social context 
provided by or referring to the user like associated tags, where and when the 
image was taken, personal profile, and contacts. The subscript ® specifies 
how the tag relevance function is constructed. We can easily interpret each 
of the three tasks: assignment and refinement can be done by sorting V in 
descending order by fs(x,t; ©), while retrieval can be achieved by sorting 
the labeled image set X; in descending order in terms of fa(x,t; O). Note 
that this formalization does not necessarily imply that the same implemen- 
tation of tag relevance is applied for all the three tasks. For example, for 
retrieval relevance is intended to obtain image ranking (Li, 2015) while tag 
ranking for each single image is the goal of assignment (Wu et al., 2009) 
and refinement (Qian et al., 2014). 

Fig. 2.1 presents a unified framework, illustrating the main data flow 
of varied approaches to tag relevance learning. Compared to traditional 
methods that rely on expert-labeled examples, a novel characteristic of a 
social media based method is its capability to learn from socially tagged 
examples with unreliable annotations. Such a training media is marked as 
S in the framework. Optionally, in order to obtain a refined training media 
S, one might consider designing a filter to remove unwanted tags and images. 
In addition, prior information such as tag statistics, tag correlations, and 
image affinities in the training media are independent of a specific image- 
tag pair. They can be precomputed for the sake of efficiency. As the filter 
and the precomputation appear to be a choice of implementation, they are 
positioned as auxiliary components in Fig. 2.1. 

A number of implementations of the relevance function are described 
and compared in Chapter 3, with regard to their use for tag assignment, 
refinement and retrieval. Depending on how fs(x,t; 9) is composed in- 
ternally, we propose a taxonomy which organizes existing works along two 
dimensions, namely media and learning. As shown in Table 2.1, the me- 
dia dimension characterizes what essential information fa(x,t;O) exploits, 
while the learning dimension depicts how such information is exploited. We 
explore the taxonomy along the media dimension in Section 2.4 and the 
learning dimension in Section 2.5, followed by a discussion on the two aux- 
iliary components in Section 2.6. 


2.4 Media for tag relevance 


Different sources of information may play a role in determining the relevance 
between an image and a social tag. For instance, the position of a tag 
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Figure 2.1: Dataflow to structure the literature on tag relevance learning for 
image tag assignment, refinement and retrieval. We follow the input data as it 
flows through the process of the tag relevance function f»(x,t; 9) to higher level tasks, 
complete with common internal activities and surrounding auxiliary components. Dashed 
lines indicate optional processes such as the auxiliary components and transduction-based 
algorithms. 


appearing in the tag list might reflect a user's tagging priority to some extent 
(Sun et al., 2011). Knowing what other tags are assigned to the image (Zhu 
et al., 2012) or what other users label about similar images (Li et al., 20095; 
Kennedy et al., 2009) can also be helpful for judging whether the tag under 
examination is appropriate or not. Depending on what modalities in S are 
utilized, we divide existing works into the following three groups: 1) tag 
based, 2) tag + image based and 3) tag + image + user information based, 
ordered in light of the amount of information they utilize. Table 2.1 shows 
this classification for several papers that appeared in the literature on the 
subject. 


2.4.1 Tag based 


These methods build fa(x,t;O) purely based on tag information. Tag posi- 
tion is considered in (Sun et al., 2011), where a tag appearing top in the tag 
list is regarded as more relevant. To find tags that are semantically close 
to the majority of the tags assigned to the test image, tag co-occurrence is 
considered in (Sigurbjórnsson and van Zwol, 2008; Zhu et al., 2012), while 
topic modeling is employed in (Xu et al., 2009). As the tag based methods 
presume that the test image has been labeled with some initial tags, i.e. the 
initial tags are taken as the user information O, they are inapplicable for 
tag assignment. 
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2.4.2 Tag + Image based 


Works in this group develop fa(x,t;O) on the base of visual information 
and associated tags. The main rationale behind them is visual consistency, 
i.e. visually similar images shall be labeled with similar tags. Implemen- 
tations of this intuition can be grouped in three conducts. One, leverage 
images visually close to the test image (Li et al., 20096; Li, Snoek and Wor- 
ring, 2010; Verbeek et al., 2010; Ma et al., 2010; Wu et al., 2011; Feng et al., 
2012). Two, exploit relationships between images labeled with the same tag 
(Liu, Hua, Yang, Wang and Zhang, 2009; Richter et al., 2012; Liu, Yan, 
Hua and Zhang, 2011; Kuo et al., 2012; Gao et al., 2013). Three, learn 
visual classifiers from socially tagged examples (Wang et al., 2009a; Chen 
et al., 2012; Li and Snoek, 2013; Yang, Gao, Zhang, Shao and Chua, 2014). 
By propagating tags based on the visual evidence, the above works exploit 
the image modality and the tag modality in a sequential way. By contrast, 
there are works that concurrently exploit the two modalities. This can be 
approached by generating a common latent space upon the image-tag asso- 
ciation (Srivastava and Salakhutdinov, 2014; Niu et al., 2014; Duan et al., 
2014), so that a cross media similarity can be computed between images and 
tags (Zhuang and Hoi, 2011; Qi et al., 2012; Liu et al., 2013). In (Pereira 
et al., 2014), the latent space is constructed by Canonical Correlation Anal- 
ysis, finding two matrices which separately project feature vectors of image 
and tag into the same subspace. In (Ma et al., 2010), a random walk model 
is used on a unified graph composed from the fusion of an image similar- 
ity graph with an image-tag connection graph. In (Wu et al., 2013; Xu 
et al., 2014; Zhu et al., 2010), predefined image similarity and tag similarity 
are used as two constraint terms to enforce that similarities induced from 
the recovered image-tag association matrix will be consistent with the two 
predefined similarities. 

Although late fusion has been actively studied for multimedia data anal- 
ysis (Atrey et al., 2010), improving tag relevance estimation by late fusion is 
not much explored. There are some efforts in that direction, among which 
interesting performance has been reported in (Qian et al., 2014) and more 
recently in (Li, 2015). 


2.4.3 Tag + Image + User information based 


In addition to tags and images, this group of works exploit user information, 
motivated from varied perspectives. With the hypothesis that a specific tag 
chosen by many users to label visually similar images is more likely to be 
relevant with respect to the visual content, (Li et al., 20090) utilizes user 
identities to ensure that learning examples come from distinct users. A sim- 
ilar idea is reported in (Kennedy et al., 2009), finding visually similar image 
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pairs with matching tags from different users. (Ginsca et al., 2014) improves 
image retrieval by favoring images uploaded by users with good credibility 
estimates. In (Sawant et al., 2010; Li, Gavves, Snoek, Worring and Smeul- 
ders, 2011), personal tagging preference is considered in the form of tag 
statistics computed from images a user has uploaded in the past. These 
past images are used in (Liu et al., 2014) to learn a user-specific embedding 
space. In (Sang, Xu and Liu, 2012), user affinities, measured in terms of 
the number of common groups users are sharing, is considered in a tensor 
analysis framework. Similarly, tensor based low-rank data reconstruction 
is employed in (Qian et al., 2015) to discover latent associations between 
users, images, and tags. Photo timestamps are exploited for time-sensitive 
image retrieval (Kim and Xing, 2013), where the connection between image 
occurrence and various temporal factors is modeled. In (McParlane et al., 
2013a), time-constrained tag co-occurrence statistics are considered to re- 
fine the output of visual classifiers for tag assignment. In their follow-up 
work (McParlane et al., 20130), location-constrained tag co-occurrence com- 
puted from images taken in a specific continent is further included. User 
interactions in social networks are exploited in (Sawant et al., 2010), com- 
puting local interaction networks from the comments left by other users. 
Social-network metadata such as group memberships of images and con- 
tacts of users is employed in (Wang et al., 20095; McAuley and Leskovec, 
2012; Johnson et al., 2015) for image classification. 


Comparing the three groups, tag + image appears to be the mainstream, 
as evidenced by the imbalanced distribution in Table 2.1. Intuitively, using 
more media from S would typically increase the reliability of tag relevance 
estimation. We attribute the imbalance among the groups, in particular the 
relatively few works in the third group, to the following two reasons. First, 
no publicly available dataset with expert annotations was built to gather 
representative and adequate user information, e.g. MIRFlickr has nearly 10k 
users for 25k images, while in NUS-WIDE only 696 of the users have at least 
15 images. As a consequence, current works that leverage user information 
are forced to use a minimal subset to alleviate sample insufficiency (Sang, 
Xu and Liu, 2012; Sang, Xu and Lu, 2012) or homemade collections with 
social tags as ground truth instead of benchmark sets (Sawant et al., 2010; 
Li, Gavves, Snoek, Worring and Smeulders, 2011). Second, adding more 
media often results in a substantial increase in terms of both computation 
and memory, e.g. the cubic complexity for tensor factorization in (Sang, Xu 
and Liu, 2012). As a trade-off, one has to use S of a much smaller scale. 
'The dilemma is whether one should use large data with less media or more 
media but less data. 

It is worth noting that the above groups are not exclusive. The output 
of some methods can be used as a refined input of some other methods. 
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In particular, we observe a frequent usage of tag-based methods by others 
for their computational efficiency. For instance, tag relevance measured in 
terms of tag similarity is used in (Zhuang and Hoi, 2011; Gao et al., 2013; 
Li and Snoek, 2013) before applying more advanced analysis, while nearest 
neighbor tag propagation is a pre-process used in (Zhu et al., 2010). The 
number of tags per image is embedded into image retrieval functions in (Liu, 
Hua, Yang, Wang and Zhang, 2009; Xu et al., 2009; Zhuang and Hoi, 2011; 
Chen et al., 2012). 

Given the varied sources of information one could leverage, the subse- 
quent question is how the information is exactly utilized, which will be made 
clear next. 


2.5 Learning for tag relevance 


'This section presents the second dimension of the taxonomy, elaborating on 
various algorithms for tag relevance learning. Depending on whether the 
tag relevance learning process is transductive, i.e., producing tag relevance 
scores without distinction as training and testing, we divide existing works 
into transduction-based and induction-based. Since the latter produces rules 
or models that are directly applicable to a novel instance (Michalski, 1983), 
it has a better scalability for large-scale data compared to its transductive 
counterpart. Depending on whether an explicit model, let it be discrimina- 
tive or generative, is built, a further division for the induction-based meth- 
ods can be made: instance-based algorithms and model-based algorithms. 
Consequently, we divide existing works into the following three exclusive 
groups: 1) instance-based, 2) model-based, and 3) transduction-based. 


2.5.1 Instance-based 


This class of methods does not perform explicit generalization but, instead, 
compares new test images with training instances. It is called instance-based 
because it constructs hypotheses directly from the training instances them- 
selves. These methods are non parametric and the complexity of the learned 
hypotheses grows as the amount of training data increases. The neighbor 
voting algorithm (Li et al., 20095) and its variants (Kennedy et al., 2009; 
Li, Snoek and Worring, 2010; Truong et al, 2012; Lee et al., 2013; Zhu 
et al., 2014) estimate the relevance of a tag t with respect to an image x by 
counting the occurrence of t in annotations of the visual neighbors of x. The 
visual neighborhood is created using features obtained from early-fusion of 
global features (Li et al., 20096), distance metric learning to combine local 
and global features (Verbeek et al., 2010; Wu et al., 2011), cross modal learn- 
ing of tags and image features (Qi et al., 2012; Ballan, Uricchio, Seidenari 
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Table 2.1: The taxonomy of methods for tag relevance learning, organized along the Media and Learning dimensions of Fig. 2.1. Methods 


for which we provide an experimental evaluation in the next chapter are indicated in bold font. 


Learning 
Media Instance-based Model-based Transduction-based 
Sigurbjórnsson et al. (Sigurbjórnsson and van Zwol, 2008) 
tag Sun et al. (Sun et al., 2011) Xu et al. (Xu et al., 2009) - 


tag + image 


tag + image + user 


Zhu et al. (Zhu et al., 2012) 


Liu et al. (Liu, Hua, Yang, Wang and Zhang, 2009) 
Makadia et al. (Makadia et al., 2010) 

Tang et al. (Tang et al., 2011) 

Wu et al. (Wu et al., 2011) 

Yang et al. (Yang et al., 2011) 

Truong et al. (Truong et al., 2012) 

Qi et al. (Qi et al., 2012) 

Lin et al. (Lin et al., 2013) 

Lee et al. (Lee et al., 2013) 

Uricchio et al. (Uricchio et al., 2013) 

Zhu et al. (Zhu et al., 2014) 

Ballan et al. (Ballan, Uricchio, Seidenari and Bimbo, 2014) 
Pereira et al. (Pereira et al., 2014) 


Wu et al. (Wu et al., 2009) 

Guillaumin et al. (Guillaumin et al., 2009) 
Verbeek et al. (Verbeek et al., 2010) 

Liu et al. (Liu et al., 2010) 

Ma et al. (Ma et al., 2010) 

Liu et al. (Liu, Yan, Hua and Zhang, 2011) 

Duan et al. (Duan et al., 2011) 

Feng et al. (Feng et al., 2012) 

Srivastava et al. (Srivastava and Salakhutdinov, 2014) 
Chen et al. (Chen et al., 2012) 

Lan et al. (Lan and Mori, 2013) 

Li et al. (Li and Snoek, 2013) 

Li et al. (Li, Liu and Lu, 2013) 

Wang et al. (Wang, Zhou, Xu, Mei, Hua and Li, 2014) 
Niu et al. (Niu et al., 2014) 


Zhu et al. (Zhu et al., 2010) 

Wang et al. (Wang et al., 2010) 

Li et al. (Li, Liu, Zhu, Liu and Lu, 2010) 
Zhuang et al. (Zhuang and Hoi, 2011) 
Richter et al. (Richter et al., 2012) 

Kuo et al. (Kuo et al., 2012) 

Liu et al. (Liu et al., 2013) 

Gao et al. (Gao et al., 2013) 

Wu et al. (Wu et al., 2013) 

Yang et al. (Yang, Gao, Zhang, Shao and Chua, 2014) 
Feng et al. (Feng et al., 2014) 

Xu et al. (Xu et al., 2014) 


Li et al. (Li et al., 2009b) 

Kennedy et al. (Kennedy et al., 2009) 
Li et al. (Li, Snoek and Worring, 2010) 
Znaidia et al. (Znaidia et al., 2013) 
Liu et al. (Liu et al., 2014) 


Sawant et al. (Sawant et al., 2010) 

Li et al. (Li, Gavves, Snoek, Worring and Smeulders, 2011) 
McAuley et al. (McAuley and Leskovec, 2012) 

Kim et al. (Kim and Xing, 2013) 

McParlane et al. (McParlane et al., 20135) 

Ginsca et al. (Ginsca et al., 2014) 

Ballan et al. (Johnson et al., 2015) 


Sang et al. (Sang, Xu and Liu, 2012) 
Sang et al. (Sang, Xu and Lu, 2012) 
Qian et al. (Qian et al., 2015) 
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and Bimbo, 2014; Pereira et al., 2014), and fusion of multiple single-feature 
learners (Li, Snoek and Worring, 2010). While the standard neighbor vot- 
ing algorithm (Li et al., 20095) simply let the neighbors vote equally, efforts 
have been made to (heuristically) weight neighbors in terms of their impor- 
tance. For instance, in (Truong et al., 2012; Lee et al., 2013) the visual 
similarity is used as the weights. As an alternative to such a heuristic strat- 
egy, (Zhu et al., 2014) models the relationships among the neighbors by 
constructing a directed voting graph, wherein there is a directed edge from 
image x, to image zj if x; is in the k nearest neighbors of xj. Subsequently 
an adaptive random walk is conducted over the voting graph to estimate 
the tag relevance. However, the performance gain obtained by these weight- 
ing strategies appears to be limited (Zhu et al., 2014). The kernel density 
estimation technique used in (Liu, Hua, Yang, Wang and Zhang, 2009) can 
be viewed as another form of weighted voting, but the votes come from 
images labeled with t instead of the visual neighbors. (Yang et al., 2011) 
further considers the distance of the test image to images not labeled with t. 
In order to eliminate semantically unrelated samples in the neighborhood, 
sparse reconstruction from a k-nearest neighborhood is used in (Tang et al., 
2009, 2011). In (Lin et al., 2013), with intention of recovering missing tags 
by matrix reconstruction, the image and tag modalities are separately ex- 
ploited in parallel to produce a new candidate image-tag association matrix 
each. Then, the two resultant tag relevance scores are linearly combined 
to produce the final tag relevance scores. To address the incompleteness of 
tags associated with the visual neighbors, (Znaidia et al., 2013) proposes 
to enrich these tags by exploiting tag co-occurrence in advance to neighbor 
voting. 


2.5.2 Model-based 


'This class of tag relevance learning algorithms puts their foundations on 
parameterized models learned from the training media. Notice that the 
models can be tag-specific or holistic for all tags. As an example of holistic 
modeling, a topic model approach is presented in (Wang, Zhou, Xu, Mei, 
Hua and Li, 2014) for tag refinement, where a hidden topic layer is intro- 
duced between images and tags. Consequently, the tag relevance function is 
implemented as the dot product between the topic vector of the image and 
the topic vector of the tag. In particular, the authors extend the Latent 
Dirichlet Allocation model (Blei et al., 2003) to force images with similar 
visual content to have similar topic distribution. According to their experi- 
ments (Wang, Zhou, Xu, Mei, Hua and Li, 2014), however, the gain of such 
a regularization appears to be marginal compared to the standard Latent 
Dirichlet Allocation model. (Li, Liu and Lu, 2013) first finds embedding 
vectors of training images and tags using the image-tag association matrix 
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of S. The embedding vector of a test image is obtained by a convex com- 
bination of the embedding vectors of its neighbors retrieved in the original 
visual feature space. Consequently, the relevance score is computed in terms 
of the Euclidean distance between the embedding vectors of the test image 
and the tag. 

For tag-specific modeling, linear SVM classifiers trained on features aug- 
mented by pre-trained classifiers of popular tags are used in (Chen et al., 
2012) for tag retrieval. Fast intersection kernel SVMs trained on selected 
relevant positive and negative examples are used in (Li and Snoek, 2013). A 
bag-based image reranking framework is introduced in (Duan et al., 2011), 
where pseudo relevant images retrieved by tag matching are partitioned into 
clusters by using visual and textual features. Then, by treating each cluster 
as a bag and images within the cluster as its instances, multiple instance 
learning (Andrews et al., 2003) is employed to learn multiple-instance SVMs 
per tag. Viewing the social tags of a test image as ground truth, a multi- 
modal tag suggestion method based on both tags and visual correlation is 
introduced in (Wu et al., 2009). Each modality is used to generate a rank- 
ing feature, and the tag relevance function is a combination of these ranking 
features, with the combination weights learned online by the RankBoost al- 
gorithm (Freund et al., 2003). In (Guillaumin et al., 2009; Verbeek et al., 
2010), logistic regression models are built per tag to promote rare tags. In a 
similar spirit to (Li and Snoek, 2013), (Zhou et al., 2015) learns an ensemble 
of SVMs by treating tagged images as positive training examples and un- 
tagged images as candidate negative training examples. Using the ensemble 
to classify image regions generated by automated image segmentation, the 
authors assign tags at the image level and the region level simultaneously. 


2.5.3 Transduction-based 


'This class of methods consists in procedures that evaluate tag relevance 
for a given image-tag pair of a set of images by minimizing some specific 
cost function. Given an initial image-tag association matrix D, the output 
of the procedure is a new matrix D the elements of which are taken as 
tag relevance scores. Due to this formulation, no explicit form of the tag 
relevance function exists nor any distinction between training and test sets 
(Joachims, 1999). If novel images are added to the initial set, minimization 
of the cost function needs to be re-computed. 

'The majority of transduction-based approaches are founded on matrix 
factorization (Zhu et al., 2010; Sang, Xu and Liu, 2012; Liu et al., 2013; 
Wu et al., 2013; Kalayeh et al., 2014; Feng et al., 2014; Xu et al., 2014). 
In (Zhuang and Hoi, 2011) the objective function is a linear combination of 
the difference between D and the matrix of image similarity, the distortion 
between D and the matrix of tag similarity, and the difference between D 
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and D. A stochastic coordinate descent optimization is applied to a ran- 
domly chosen row of D per iteration. In (Zhu et al., 2010), considering the 
fact that D is corrupted with noise derived by missing or over-personalized 
tags, robust principal component analysis with laplacian regularization is 
applied to recover D as a low-rank matrix. In (Wu et al., 2013), D is regu- 
larized such that the image similarity induced from D is consistent with the 
image similarity computed in terms of low-level visual features, and the tag 
similarity induced from D is consistent with the tag correlation score com- 
puted in terms of tag co-occurrence. In (Xu et al., 2014), it is proposed to 
re-weight the penalty term of each image-tag pair by their relevance score, 
which is estimated by a linear fusion of tag-based and content-based rele- 
vance scores. To incorporate the user element, (Sang, Xu and Liu, 2012) 
extends D to a three-way tensor with tag, image, and user as each of the 
ways. A core tensor and three matrices representing the three media, ob- 
tained by Tucker decomposition (Tucker, 1966), are multiplied to construct 
D. 


As an alternative approach, in (Feng et al., 2014) it is assumed that 
the tags of an image are drawn independently from a fixed but unknown 
multinomial distribution. Estimation of this distribution is implemented by 
maximum likelihood with low-rank matrix recovery and laplacian regular- 
ization like (Zhu et al., 2010). 


Graph-based label propagation is another type of transduction-based 
methods. In (Richter et al., 2012; Wang et al., 2010; Kuo et al., 2012), 
the image-tag pairs are represented as a graph in which each node cor- 
responds to a specific image and the edges are weighted according to a 
multi-modal similarity measure. Viewing the top ranked examples in the 
initial search results as positive instances, tag refinement is implemented as 
a semi-supervised labeling process by propagating labels from the positive 
instances to the remaining examples using random walk. While the edge 
weights are fixed in the above works, (Gao et al., 2013) argues that fixing 
the weights could be problematic, because tags found to be discriminative in 
the learning process should adaptively contribute more to the edge weights. 
In that regard, the hypergraph learning algorithm (Zhou et al., 2006) is ex- 
ploited and weights are optimized by minimizing a joint loss function which 
considers both the graph structure and the divergence between the initial 
labels and the learned labels. In (Liu, Wu, Zhang, Shao and Zhuang, 2011), 
the hypergraph is embedded into a lower-dimension space by hypergraph 
Laplacian. 


Comparing the three groups of methods for learning tag relevance, an 
advantage of instance-based methods against the other two groups is their 
flexibility to adapt to previously unseen images and tags. They may simply 
add new training images into S or remove outdated ones. The advantage 
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however comes with a price that S has to be maintained, a non-trivial task 
given the increasing amount of training data available. Also, the computa- 
tional complexity and memory footprint grow linearly with respect to the 
size of S. In contrast, model-based methods could be more swift, especially 
when linear classifiers are used, as the training data is compactly repre- 
sented by a fixed number of models. As the imagery of a given tag may 
evolve, re-training is required to keep the models up-to-date. 

Different from instance-based and model-based learning where individ- 
ual tags are considered independently, transduction-based learning methods 
via matrix factorization can favorably exploit inter-tag and inter-image re- 
lationships. However, their ability to deal with the extremely large number 
of social images is a concern. For instance, the use of Laplacian graphs re- 
sults in a memory complexity of O(|S|?). The accelerated proximal gradient 
algorithm used in (Zhu et al., 2010) requires Singular Value Decomposition, 
which is known to be an expensive operation. The Tucker decomposition 
used in (Sang, Xu and Liu, 2012) has a cubic computational complexity with 
respect to the number of training samples. We notice that some engineering 
tricks have been considered in these works, which alleviate the scalability 
issue to some extent. In (Zhuang and Hoi, 2011), for instance, clustering is 
conducted in advance to divide $ into much smaller subsets, and the algo- 
rithm is applied to these subsets, separately. By making the Laplacian more 
sparse by retaining only the k nearest neighbors (Zhu et al., 2010; Sang, Xu 
and Liu, 2012), the memory footprint can be reduced to O(k-|S|), with the 
cost of performance degeneration. Perhaps due to the scalability concern, 
works resorting to matrix factorization tend to experiment with a dataset 
of relatively small scale. 


2.6 Auxiliary components 


The Filter and the Precompute component are auxiliary components that 
may sustain and improve tag relevance learning. 

Filter. As social tags are known to be subjective and overly personal- 
ized, removing personalized tags appears to be a natural and simple way 
to improve the tagging quality. This is usually the first step performed in 
the framework for tag relevance learning. Although there is a lack of golden 
criteria to determine which tags are personalized, a popular strategy is to 
exclude tags which cannot be found in the WordNet ontology (Zhu et al., 
2010; Li, Gavves, Snoek, Worring and Smeulders, 2011; Chen et al., 2012; 
Zhu et al., 2012) or a Wikipedia thesaurus (Liu, Hua, Yang, Wang and 
Zhang, 2009). Tags with rare occurrence, say appearing less than 50 times, 
are discarded in (Verbeek et al., 2010; Zhu et al., 2010). For methods that 
directly work on the image-tag association matrix (Zhu et al., 2010; Sang, 


30 


Tiberio Uricchio 


Xu and Liu, 2012; Wu et al., 2013; Lin et al., 2013), reducing the size of the 
vocabulary in terms of tag occurrence is an important prerequisite to keep 
the matrix in a manageable scale. Observing that images tagged in a batch 
manner are often nearly duplicate and of low tagging quality, batch-tagged 
images are excluded in (Li et al., 2012). Since relevant tags may be missing 
from user annotations, the negative tags that are semantically similar or 
co-occurring with positive ones are discarded in (Sang, Xu and Liu, 2012). 
As the above strategies do not take the visual content into account, they 
cannot handle situations where an image is incorrectly labeled with a valid 
and frequently used tag, say ‘dog’. In (Li et al., 2009a), tag relevance scores 
are assigned to each image in S by running the neighbor voting algorithm 
(Li et al., 20090), while in (Li and Snoek, 2013), the semantic field algorithm 
(Zhu et al., 2012) is further added to select relevant training examples. In 
(Qian et al., 2015), the annotation of the training media is enriched by a 
random walk. 


Precompute. The precompute component is responsible for the genera- 
tion of the prior information that is jointly used with the refined training 
media $ in learning. For instance, global statistics and external resources 
can be used to synthesize new prior knowledge useful in learning. The prior 
information commonly used is tag statistics in $, including tag occurrence 
and tag co-occurrence. Tag occurrence is used in (Li et al., 20095) as a 
penalty to suppress overly frequent tags. Measuring the semantic simi- 
larity between two tags is important for tag relevance learning algorithms 
that exploit tag correlations. While linguistic metrics as those derived from 
WordNet were used before the proliferation of social media (Jin et al., 2005; 
Wang et al., 2006), they do not directly reflect how people tag images. For 
instance, tag ‘sunset’ and tag ‘sea’ are weakly related according to the Word- 
Net ontology, but they often appear together in social tagging as many of 
the sunset photos are shot around seasides. Therefore, similarity measures 
that are based on tag statistics computed from many socially tagged im- 
ages are in dominant use. Sigurbjórnsson and van Zwol utilized the Jaccard 
coefficient and a conditional tag probability in their tag suggestion system 
(Sigurbjórnsson and van Zwol, 2008), while Liu et al. used normalized tag 
co-occurrence (Liu et al., 2013). To better capture the visual relationship 
between two tags, Wu et al. proposed the Flickr distance (Wu et al., 2008). 
'The authors represent each tag by a visual language model, trained on bag 
of visual words features of images labeled with this tag. The Flickr distance 
between two tags is computed as the Jensen-Shannon divergence between 
the corresponding models. Later, Jiang et al. introduced the Flickr context 
similarity, which also captures the visual relationship between two tags, but 
without the need of the expensive visual modeling (Jiang et al., 2009). The 
trick is to compute the Normalized Google Distance (Cilibrasi and Vitanyi, 
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2004) between two tags, but with tag statistics acquired from Flickr image 
collections instead of Google indexed web pages. For its simplicity and ef- 
fectiveness, we observe a prevalent use of the Flickr context similarity in the 
literature (Liu, Hua, Yang, Wang and Zhang, 2009; Zhu et al., 2010; Wang 
et al., 2010; Zhuang and Hoi, 2011; Zhu et al., 2012; Gao et al., 2013; Li 
and Snoek, 2013; Qian et al., 2014). 


2.7 Conclusions 


We presented a survey on image tag assignment, refinement and retrieval, 
with the hope of illustrating connections and difference between the many 
methods and their applicabilities, and consequently helping the interested 
audience to either pick up an existing method or devise a method of their 
own given the data at hand. As the topics are being actively studied, 
inevitably this survey will miss some papers. Nevertheless, it provides a 
unified view of many existing works, and consequently eases the effort of 
placing future works in a proper context, both theoretically and experimen- 
tally. Based on the key observation that all works rely on tag relevance 
learning as the common ingredient, exiting works, which vary in terms of 
their methodologies and target tasks, are interpreted in a unified framework. 
Consequently, a two-dimensional taxonomy has been developed, allowing us 
to structure the growing literature in light of what information a specific 
method exploits and how the information is leveraged in order to produce 
their tag relevance scores. 
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Chapter 3 
A new Experimental Protocol 


In this chapter we propose an evaluation test-bed for the three 
linked tasks of Assignment, Refinement and Retrieval. Train- 
ing sets of varying sizes amd three test datasets are considered 
to evaluate methods of varied learning complexity. A selected set 
of eleven representative works have been implemented and. eval- 
uated. Several overall patterns are recognized. To highlight the 
advantages of socially tagged training sets, am empirical evalua- 
tion between ImageNet and the proposed Flickr-based training sets 
is reported. ! 


3.1 Introduction 


In spite of the expanding literature, there is a lack of consensus on the 
performance of the individual methods. This is largely due to the fact 
that existing works either use homemade data, see (Liu, Hua, Yang, Wang 
and Zhang, 2009; Wang et al., 2010; Chen et al., 2012; Gao et al., 2013), 
which are not publicly accessible, or use selected subsets of benchmark data, 
e.g. as in (Zhu et al., 2010; Sang, Xu and Liu, 2012; Feng et al., 2014). As 
a consequence, the performance scores reported in the literature are not 
comparable across the papers. 

Benchmark data with manually verified labels is crucial for an objective 
evaluation. As Flickr has been well recognized as a profound manifestation 
of social image tagging, Flickr images act as a main source for benchmark 
construction. MIRFlickr from the Leiden University (Huiskes et al., 2010) 
and NUS-WIDE from the National University of Singapore (Chua et al., 
2009) are the two most popular Flickr-based benchmark sets for social image 
tagging and retrieval, as demonstrated by the number of citations. On the 
use of the benchmarks, one typically follows a single-set protocol, that is, 


1Parts of this chapter previously appeared in Li, X., Uricchio, T., Ballan, L., Bertini, M., 
Snoek, C. G. and Del Bimbo, A. (2016). “Socializing the semantic gap: A comparative survey on 
image tag assignment, refinement, and retrieval". ACM Computing Surveys (CSUR), 49(1), 14. 
'The publication is available at http://dx.doi.org/10.1145/2906152 
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learning the underlying tag relevance function from the training part of a 
chosen benchmark set, and evaluating it on the test part. Such a protocol 
is inadequate given the dynamic nature of social media, which could easily 
make an existing benchmark set outdated. For any method targeting at 
social images, a cross-set evaluation is necessary to test its generalization 
ability, which is however overlooked in the literature. 

Another desirable property is the capability to learn from the increasing 
amounts of socially tagged images. While existing works mostly use training 
data of a fixed scale, this property has not been well evaluated. 

Following these considerations, we present a new experimental proto- 
col, wherein training and test data from distinct research groups are chosen 
for evaluating a number of representative works in the cross-set scenario. 
Training sets with their size ranging from 10k to one million images are con- 
structed to evaluate methods of varied complexity. To the best of our knowl- 
edge, such a comparison between many methods on varied scale datasets 
with a common experimental setup has not been conducted before. For the 
sake of experimental reproducibility, all data and code is made available 
online at www.micc.unifi.it/tagsurvey/. 


3.2 Datasets 


We describe the training media S and the test media A as follows, with 
basic data characteristics and their usage summarized in Table 3.1. 

Training media S. We use a set of 1.2 million Flickr images collected by 
the University of Amsterdam (Li et al., 2012), by using over 25,000 nouns in 
WordNet as queries to uniformly sample images uploaded between 2006 and 
2010. Based on our observation that batch-tagged images, namely those 
labeled with the same tags by the same user, tend to be near duplicate, 
we have excluded these images beforehand. Other than this, we do not 
perform near-duplicate image removal. To meet with methods that cannot 
handle large data, we created two random subsets from the entire training 
sets, resulting in three training sets of varied sizes, termed as Train10k, 
Train100k, and Trainlm, respectively. 

Test media X. We use MIRFlickr (Huiskes et al., 2010) and NUS-WIDE 
(Chua et al., 2009) for tag assignment and refinement, as in (Verbeek et al., 
2010; Zhu et al., 2010; Uricchio et al., 2013) and (Tang et al., 2011; McAuley 
and Leskovec, 2012; Zhu et al., 2010; Uricchio et al., 2013) respectively. We 
use NUS-WIDE for evaluating tag retrieval as in (Sun et al., 2011; Li, Duan, 
Xu and Tsang, 2011). In addition, for retrieval we collected another test set 
namely Flickr51 contributed by Microsoft Research Asia (Wang et al., 2010; 
Gao et al., 2013). The MIRFlickr set contains 25,000 images with ground 
truth available for 14 tags. The NUS-WIDE set contains 259,233 images, 
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Table 3.1: Our proposed experimental protocol instantiates the Media and Tasks dimen- 
sions of Fig. 2.1 with three training sets and three test sets for tag assignment, refinement 
and retrieval. Note that the training sets are socially tagged, they have no ground truth 
available for any tag. 


Media characteristics Tasks 

Media # images # tags # users # test tags assignment refinement retrieval 
Training media S: 

Train10k 10,000 41,253 9,249 x v vV v 
Train100k 100,000 214,666 68,215 = v Y v 
Trainlm (Li et al., 2012) 1,198,818 1,127,139 347,369 = v Y v 
Test media X: 

MIRFlickr (Huiskes et al., 2010) 25,000 67,389 9,862 14 v v - 
Flickr51 (Wang et al., 2010) 81,541 66,900 20,886 51 = = 4 
NUS-WIDE (Chua et al., 2009) 259,233 355,913 51,645 81 v vV v 


Table 3.2: Data overlap between Train1M and the three test sets, measured in terms 
of the number of shared images, tags, and users, respectively. Tag overlap is counted 
on the top 1,000 most frequent tags. As the original photo ids of MIRFlickr have been 
anonymized, we cannot check image overlap between this dataset and Train1M. 


Overlap with Train1M 


Test media # images # tags # users 


MIRFlickr — 693 6,515 
Flickr51 130 538 14,211 
NUS-WIDE 7,975 718 38,481 


with ground truth available for 81 tags. The Flickr51 set consists of 81,541 


Flickr images with partial ground truth provided for 55 test tags. Among 
the 55 tags, there are 4 tags which either have zero occurrence in our training 
data or have no correspondence in WordNet, so we ignore them. Differently 
from the binary judgments in NUS-WIDE, Flickr51 provides graded rele- 
vance, with 0, 1, and 2 to indicate irrelevant, relevant, and very relevant, 
respectively. Moreover, the set contains several ambiguous tags such as ‘ap- 
ple’ and ‘jaguar’, where relevant instances could exhibit completely different 
imagery, e.g., Apple computers versus fruit apples. Following the original 
intention of the datasets, we use MIRFlickr and NUS-WIDE for evaluating 
tag assignment and tag refinement, and Flickr51 and NUS-WIDE for tag 
retrieval. For all the three test sets, we use the full dataset for testing. 


Although the training and test media are all from Flickr, they were col- 
lected independently, and consequently they have a relatively small amount 
of images overlapped with each other, as shown in Table 3.2. 
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3.3 Implementation and Evaluation 


This section describes common implementations applicable to all the three 
tasks, including the choice of visual features and tag preprocessing. Imple- 
mentations that are applied uniquely to single tasks will be described in the 
coming sections. 


Visual features. Two types of features are extracted to provide insights 
of the performance improvement achievable by appropriate feature selec- 
tion: the classical bag of visual words (BoVW) and the current state of 
the art deep learning based features extracted from Convolutional Neural 
Networks (CNN). The BoVW feature is extracted by the color descriptor 
software (van de Sande et al., 2010). SIFT descriptors are computed at 
dense sampled points, at every 6 pixels for two scales. A codebook of size 
1,024 is created by K-means clustering. The SIFTs are quantized by the 
codebook using hard assignment, and aggregated by sum pooling. In addi- 
tion, we extract a compact 64-d global feature (Li, 2007), combining a 44-d 
color correlogram, a 14-d texture moment, and a 6-d RGB color moment, to 
compensate the BoVW feature. The CNN feature is extracted by the pre- 
trained VGGNet (Simonyan and Zisserman, 2015). In particular, we adopt 
the 16-layer VGGNet, and take as feature vectors the last fully connected 
layer of ReLU activation, resulting in a feature vector of 4,096 dimensions 
per image. The BoVW feature is used with the lı distance and the CNN 
feature is used with the cosine distance for their good performance. 


Vocabulary V. As what tags a person may use is meant to be open, the 
need of specifying a tag vocabulary is merely an engineering convenience. 
For a tag to be meaningfully modeled, there has to be a reasonable amount 
of training images with respect to that tag. For methods where tags are 
processed independently from the others, the size of the vocabulary has no 
impact on the performance. In the other cases, in particular for transductive 
methods that rely on the image-tag association matrix, the tag dimension 
has to be constrained to make the methods runnable. In our case, for these 
methods a three-step automatic cleaning procedure is performed on the 
training datasets. First, all the tags are lemmatized to their base forms by 
the NLTK software (Bird et al., 2009). Second, tags not defined in WordNet 
are removed. Finally, in order to avoid insufficient sampling, we remove 
tags that cannot meet a threshold on tag occurrence. The thresholds are 
empirically set as 50, 250, and 750 for Train10k, Train100k, and Trainlm, 
respectively, in order to have a linear increase in vocabulary size versus a 
logarithmic increase in the number of labeled images. This results in a final 
vocabulary of 237, 419, and 1,549 tags, respectively, with all the test tags 
included. Note that these numbers of tags are larger than the number of 
tags that can be actually evaluated. This allows to build a unified learning 
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method that is more handy for cross-dataset evaluation and exploit inter-tag 
relationships. 


3.3.1 Evaluating tag assignment 


Evaluation criteria. A good method for tag assignment shall rank relevant 
tags before irrelevant tags for a given test image. Moreover, with the as- 
signed tags, relevant images shall be ranked before irrelevant images for a 
given test tag. We therefore use the image-centric Mean image Average 
Precision (MiAP) to measure the quality of tag ranking, and the tag-centric 
Mean Average Precision (MAP) to measure the quality of image ranking. 
Let Mg be the number of ground-truthed test tags, which is 14 for MIR- 
Flickr and 81 for NUS-WIDE. The image-centric Average Precision of a 
given test image x is computed as 


. mE rj 
iAP(x) := R 2 j og fo (3.1) 
where R is the number of relevant tags of the given image, r; is the number 
of relevant tags in the top j ranked tags, and ó(z;,1;) = 1 if tag t; is 
relevant and 0 otherwise. MiAP is obtained by averaging 4AP(x) over the 
test images. 
'The tag-centric Average Precision of a given test tag t is computed as 


APO So aec (3.2) 
Re 
where R is the number of relevant images for the given tag, and r; is the 
number of relevant images in the top ranked images. MAP is obtained by 
averaging AP(t) over the test tags. 

'The two metrics are complementary to some extent. Since MiAP is aver- 
aged over images, each test image contributes equally to MiAP, as opposed 
to MAP where each tag contributes equally. Consequently, MiAP is biased 
towards frequent tags, while MAP can be easily affected by the performance 
of rare tags, especially when my; is relatively small. 

Baseline. Any method targeting at tag assignment shall be better than 
a random guess, which simply returns a random set of tags. The Ran- 
domGuess baseline is obtained by computing MiAP and MAP given the 
random prediction, which is run 100 times with the resulting scores aver- 
aged. 


3.3.2 Evaluating tag refinement 


Evaluation criteria. As tag refinement is also meant for improving tag 
ranking and image ranking, it is evaluated by the same criteria, i.e., MiAP 
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and MAP, as used for tag assignment. 
Baseline. A natural baseline for tag refinement is the original user tags 
assigned to an image, which we term as UserTags. 


3.3.3 Evaluating tag retrieval 


Evaluation criteria. To compare methods for tag retrieval, for each test tag 
we first conduct tag-based image search to retrieve images labeled with that 
tag, and then sort the images by the tag relevance scores. We use MAP 
to measure the quality of the entire image ranking. As users often look at 
the top ranked results and hardly go through the entire list, we also report 
Normalized Discounted Cumulative Gain (NDCG), commonly used to eval- 
uate the top few ranked results of an information retrieval system (Jarvelin 
and Kekäläinen, 2002). Given a test tag t, its NDCG at a particular rank 
position h is defined as: 


DCG,(t 
NDCG»(t) = REGIA aL. (3.3) 
DCCA = uer 2 (34) 
d “ log;(i + 1)' 


where rel; is the graded relevance of the result at position i, and [DCG), 
is the maximum possible DCG till position h. We set h to be 20, which 
corresponds to a typical number of search results presented on the first two 
pages of a web search engine. Similar to MAP, NDCGx of a specific method 
on a specific test set is averaged over the test tags of that test set. 

Baselines. When searching for relevant images for a given tag, it is 
natural to ask how much a specific method gains compared to a baseline 
system which simply returns a random subset of images labeled with that 
tag. Similar to the refinement baseline, we also denote this baseline as 
UserTags, as both of them purely use the original user tags. For each test 
tag, the test images labeled with this tag are sorted at random, and MAP 
and NDCGs»o are computed accordingly. The process is executed 100 times, 
and the average score over the 100 runs is reported. 

The number of tags per image is often included for image ranking in 
previous works (Liu, Hua, Yang, Wang and Zhang, 2009; Xu et al., 2009). 
Hence, we build another baseline system, denoted as TagNum, which sort 
images in ascending order by the number of tags per image. The third base- 
line, denoted as TagPosition, is from (Sun et al., 2011), where the relevance 
score of a tag is determined by its position in the original tag list uploaded 
by the user. More precisely, the score is computed as 1 — position(t)/l, 
where / is the tag number. 
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3.4 Methods under analysis 


Despite the rich literature, most works do not provide code. An exhaustive 
evaluation covering all published methods is impractical. We have to leave 
out methods that do not show significant improvements or novelties w.r.t. 
the seminal papers in the field, and methods that are difficult to replicate 
with the same mathematical preciseness as intended by their developers. We 
drive our choice by the intention to cover methods that aim for each of the 
three tasks, exploiting varied modalities by distinct learning mechanisms. 
Eventually we evaluate 11 representative methods. For each method we 
analyze its scalability in terms of both computation and memory. Our 
analysis leaves out operations that are independent of specific tags and thus 
only need to be executed once in an offline manner, such as visual feature 
extraction, tag preprocessing, prior information precomputing, and filtering. 
Main properties of the methods are summarized in table 3.3. Concerning 
the choices of parameters, we adopt what the original papers recommend. 
When no recommendation is given for a specific method, we try a range of 
values to our best understanding, and choose the parameters that yield the 
best overall performance. 


3.4.1 SemanticField 


SemanticField (Zhu et al., 2012) measures tag relevance in terms of an 
averaged semantic similarity between the tag and the other tags assigned 
to the image: 


fsemrieia(z, t) PT (t, ti) (3.5) 


where {t1,...,t,,} is a list of l, social tags assigned to the image x, and 
sim(t,t;) denotes a semantic similarity between two tags. SemanticField 
explicitly assumes that several tags are associated to visual data and their 
coexistence is accounted in the evaluation of tag relevance. Following (Zhu 
et al., 2012), the similarity is computed by combining the Flickr context 
similarity and the WordNet Wu-Palmer similarity (Wu and Palmer, 1994). 
'The WordNet based similarity exploits path length in the WordNet hierar- 
chy to infer tag relatedness. We make a small revision of (Zhu et al., 2012), 
i.e. combining the two similarities by averaging instead of multiplication, 
because the former strategy produces slightly better results. SemanticField 
requires no training except for computing tag-wise similarity, which can be 
computed offline and is thus omitted. Having all tag-wise similarities in 
memory, applying Eq. (3.5) requires l, table lookups per tag. Hence, the 
computational complexity is O(m - l), and O(m?) for memory. 
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3.4.2 TagRanking 


The tag ranking algorithm (Liu, Hua, Yang, Wang and Zhang, 2009) consists 
of two steps. Given an image x and its tags, the first step produces an 
initial tag relevance score for each of the tags, obtained by (Gaussian) kernel 
density estimation on a set of n — 1,000 images labeled with each tag, 
separately. Secondly, a random walk is performed on a tag graph where the 
edges are weighted by a tag-wise similarity. We use the same similarity as 
in SemanticField. Notice that when applied for tag retrieval, the algorithm 
uses the rank of t instead of its score, i.e., 


JTagRanking (£, t) = —rank(t) T = (3.6) 
where rank(t) returns the rank of t produced by the tag ranking algorithm. 
The term i is a tie-breaker when two images have the same tag rank. 
Hence, for a given tag t, TagRanking cannot distinguish relevant images 
from irrelevant images if t is the sole tag assigned to them. It explicitly ex- 
ploits the coexistence of several tags per image. TagRanking has no learning 
stage. To derive tag ranks for Eq. 3.6, the main computation is the kernel 
density estimation on 7 socially-tagged examples for each tag, followed by 
an L iteration random walk on the tag graph of m nodes. All this results 
in a computation cost of O(m - d- n + L- m?) per test image. Because 
the two steps are executed sequentially, the corresponding memory cost is 
O(max(dn, m?)). 


3.4.3 KNN 


This algorithm (Makadia et al., 2010) estimates the relevance of a given 
tag with respect to an image by first retrieving k nearest neighbors from 
S based on a visual distance d, and then counting the tag occurrence in 
associated tags of the neighborhood. In particular, KNN builds fs(z,t; O) 
as: 


fewn(2,t) := ki, (3.7) 


where k, is the number of images with t in the visual neighborhood of x. 
The instance-based KNN requires no training. The main computation of 
Íkww is to find k nearest neighbors from S, which has a complexity of 
O(d-|S|+k-log|S|) per test image, and a memory footprint of O(d-|S|) to 
store all the d-dimensional feature vectors. It is worth noting that these com- 
plexities are drawn from a straightforward implementation of k-nn search, 
and can be substantially reduced by employing more efficient search tech- 
niques, c.f. (Jégou et al., 2011). Accelerating KNN by the product quanti- 
zation technique (Jégou et al., 2011) imposes an extra training step, where 
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one has to construct multiple vector quantizers by K-means clustering, and 
further use the quantizers to compress the original feature vector into a few 
codes. 


3.4.4 TagVote 


The TagVote (Li et al., 20095) algorithm estimates the relevance of a tag 
t w.r.t. an image x by counting the occurrence frequency of t in social 
annotations of the visual neighbors of x. Differently from KNN, TagVote 
exploits the user element in the social framework and introduces a unique- 
user constraint on the neighbor set to make the voting result more objective. 
Each user has at most one image in the neighbor set. Moreover, TagVote 
also takes into account tag prior frequency to suppress over frequent tags. 
In particular, the TagVote algorithm builds fa(x,t; ©) as 
Nit 
fTagvotel£, t) dm k, ME 
where n, is the number of images labeled with t in S. Following (Li et al., 
20098), we set k to be 1,000 for both KNN and TagVote. TagVote has the 
same order of complexity as KNN. 


(3.8) 


3.4.5 TagProp 


TagProp (Guillaumin et al., 2009; Verbeek et al., 2010) employs neighbor 
voting plus distance metric learning. A probabilistic framework is proposed 
where the probability of using images in the neighborhood is defined based 
on rank or distance-based weights. TagProp builds f(a, t; O) as: 


k 

fragProp(#,t) :— >_ 1; + 1(a;,t), (3.9) 

J 

where 7; is a non-negative weight indicating the importance of the j-th 
neighbor x;, and I(x;,t) returns 1 if x; is labeled with t, and 0 otherwise. 
Following (Verbeek et al., 2010), we use k = 1,000 and the rank-based 
weights, which showed similar performance to the distance-based weights. 
Differently from TagVote that uses tag prior to penalize frequent tags, Tag- 
Prop promotes rare tags and penalizes frequent ones by training a logistic 
model per tag upon fragprop(t,t). The use of the logistic model makes 
TagProp a model-based method. In contrast to KNN and TagVote wherein 
visual neighbors are treated equally, TagProp employs distance metric learn- 
ing to re-weight the neighbors, yielding a learning complexity of O(l- m- k) 
where / is the number of gradient descent iterations it needs (typically less 
than 10). TagProp maintains 2m extra parameters for the logistic mod- 
els, though their storage cost is ignorable compared to the visual features. 
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Therefore, running Eq. (3.9) has the same order of complexity as KNN and 
TagVote. 


3.4.6 TagCooccur 


While both SemanticField and TagCooccur are tag-based, the main differ- 
ence lies in how they compute the contribution of a specific tag to the test 
tag's relevance score. Different from SemanticField which uses tag similar- 
ities, TagCooccur (Sigurbjórnsson and van Zwol, 2008) uses the test tag's 
rank in the tag ranking list created by sorting all tags in terms of their 
co-occurrence frequency with the tag in a social framework. In addition, 
TagCooccur takes into account the stability of the tag, measured by its 
frequency. The method is implemented as 


la 
Fiagcooccur (x, t) = descript(t) ` vote(t;,t) - rank-promo(t;, t) - stability(t;), 
i=1 
(3.10) 
where descript(t) is to damp the contribution of tags with a very high- 
frequency, rank-promo(t;,t) measures the rank-based contribution of t; to 
t, stability(t;) for promoting tags for which the statistics are more stable, 
and vote(t;, t) is 1 if t is among the top 25 ranked tags of t;, and 0 otherwise. 
'TagCooccur has the same order of complexity as SemanticField. 


3.4.7 TagCooccur+ 


TagCooccur+ (Li et al., 2009b) is proposed to improve TagCooccur by 
adding the visual content. This is achieved by multiplying fragcooccur(2, t) 
with a content-based term, i.e., 


ke 


ke +re(t) — 1° E 


NETTA (x, t) = Jissdesórur (x, t) . 


where r.(t) is the rank of t when sorting the vocabulary by fragvor (c, t) 
in descending order, and k, is a positive weighting parameter, which is 
empirically set to 1. While TagCooccur+ is grounded on TagCooccur and 
TagVote, the complexity of the former is ignorable compared to the latter, 
so the complexity of TagCooccurs+ is the same as KNN. 


3.4.8 TagFeature 


The basic idea of TagFeature (Chen et al., 2012) is to enrich image features 
by adding an extra tag feature. It thus relies on the possible presence of 
several tags per image in the training set. In particular, a tag vocabulary 
that consists of d' most frequent tags in S is constructed first. Then, for 
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each tag a two-class linear SVM classifier is trained using LIBLINEAR (Fan 
et al., 2008). The positive training set consists of p images labeled with the 
tag in $, and the same amount of negative training examples are randomly 
sampled from images not labeled with the tag. The probabilistic output of 
the classifier, obtained by the Platt’s scaling (Lin et al., 2007), corresponds 
to a specific dimension in the tag feature. By concatenating the tag and 
visual features, an augmented feature of d + d' dimension is obtained. For 
a test tag t, its tag relevance function fragreature(Z,t) is obtained by re- 
training an SVM classifier using the augmented feature. The linear property 
of the classifier allows us to first sum up all the support vectors into a single 
vector and consequently to classify a test image by the inner product with 
this vector. That is, 


J'ragFoutaselts t) = b+ < Ti, T>, (3.12) 


where r, is the weighted sum of all support vectors and b the intercept. 
'To build meaningful classifiers, we use tags that have at least 100 positive 
examples. While d' is chosen to be 400 in (Chen et al., 2012), the two 
smaller training sets, namely Train10k and Train100k, have 76 and 396 
tags satisfying the above requirement. We empirically set p to 500, and 
do a random down-sampling if the amount of images for a tag exceeds this 
number. For TagFeature, learning a linear classifier for each tag from p 
positive and p negative examples requires O((d + d')p) in computation and 
O((d 4- d')p) in memory (Fan et al., 2008). Running Eq. (3.12) for all the m 
tags and n images needs O(nm(d + d')) in computation and O(m(d + d^)) 
in memory. 


3.4.9 RelExample 


Different from TagFeature (Chen et al., 2012) that learns from tagged im- 
ages, RelExample (Li and Snoek, 2013) exploits positive and negative train- 
ing examples which are deemed to be more relevant with respect to the test 
tag t. In particular, relevant positive examples are selected from S by 
combining SemanticField and TagVote in a late fusion manner. For neg- 
ative training example acquisition, they leverage Negative Bootstrap (Li, 
Snoek, Worring, Koelma and Smeulders, 2013), a negative sampling algo- 
rithm which iteratively selects negative examples deemed most relevant for 
improving classification. A T-iteration Negative Bootstrap will produce T' 
meta classifiers. The corresponding tag relevance function is written as 


1 T ni 
Preimecipiel®, t) = T » (b + 5 Ql j y Yl, j È K(x, 21,4)), (3.13) 
1 j=l 


l= 


where a; is a positive coefficient of support vector zj;, yj € {-1,1} is 
class label, and n; the number of support vectors in the /-th classifier. For 
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the sake of efficiency, the kernel function K is instantiated with the fast 
intersection kernel (Maji et al., 2008). RelExample uses the same amount 
of positive training examples as TagFeature. The number of iterations T' 
is empirically set to 10. For the SVM classifiers used in TagFeature and 
RelExample, the Platt’s scaling (Lin et al., 2007) is employed to convert 
prediction scores into probabilistic output. In RelExample, for each tag 
learning a histogram intersection kernel SVM has a computation cost of 
O(dp?) per iteration, and O(Tdp?) for T iterations. By jointly using the 
fast intersection kernel with a quantization factor of q (Maji et al., 2008) 
and model compression (Li, Snoek, Worring, Koelma and Smeulders, 2013), 
an order of O(dq) is needed to keep all learned meta classifiers in memory. 
Since learning a new classifier needs a memory of O(dp), the overall memory 
cost for training RelExample is O(dp+dq). For each tag, model compression 
is applied to its learned ensemble in advance to running Eq. (3.13). Asa 
consequence, the compressed classifier can be cached in an order of O(dq) 
and executed in an order of O(d). 


3.4.10 RobustPCA 


RobustPCA (Zhu et al., 2010) has been explicitly modeled to deal with 
a social framework, including noisy tags and several tags per image. On 
the base of robust principal component analysis (Candès et al., 2011), it 
factorizes the image-tag matrix D by a low rank decomposition with error 
sparsity. That is, 

D=D+E, (3.14) 


where the reconstructed D has a low rank constraint based on the nuclear 
norm, and E is an error matrix with a /,-norm sparsity constraint. Notice 
that the decomposition is not unique. So for a better solution, the decom- 
position process takes into account image affinities and tag affinities, by 
adding two extra penalties with respect to a Laplacian matrix L; from the 
image affinity graph and another Laplacian matrix Ly from the tag affinity 
graph. Consequently, two hyper-parameters A; and As are introduced to 
balance the error sparsity and the two Laplacian strengths. We follow the 
original paper and set the two parameters by performing a grid search on 
the very same proposed range. As user tags are usually missing, the authors 
proposed a pre-processing step where D is reinitialized by a weighted KNN 
propagation based on the visual similarity. RobustPCA requires an itera- 
tive procedure based on the accelerated proximal gradient method with a 
quadratic convergence rate (Zhu et al., 2010). Each iteration spends the ma- 
jority of the required time performing Singular Value Decomposition that, 
according to (Golub and Van Loan, 2012), has a well known complexity 
of O(cm?n + c'n?) where c,c' are constants. Regarding memory, it has a 
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requirement of O(cn - m + c - (n? + m?)) as it needs to process a full copy 
of D and Laplacians of images and labels. 


3.4.11  TensorAnalysis 


This method (Sang, Xu and Liu, 2012) has been explicitly designed for social 
frameworks. It explicitly considers ternary relationships between images, 
tags and user. User relationships are exploited by extending the image-tag 
association matrix to a binary user-image-tag tensor F € (0,1) *IxIVIx Iul. 
'The tensor is factorized by Tucker decomposition into a dense core C and 
three low rank matrices U, I, T, which correspond to the user, image, and 
tag modalities, respectively: 


F=Cx,Ux;jIx,T, (3.15) 


Here x, is the tensor product between a tensor and a matrix along di- 
mension k. The idea is that C contains the interactions between modal- 
ities, while each low rank matrix represent the main components of each 
modality. Every modality has to be sized manually or by energy reten- 
tion, adding three needed parameters R = (r7,rr,ry). The eventual tag 
relevance function is obtained after the optimization process by computing 
D= 6 oI RT Xx, 1,,. Similar to RobustPCA, the decomposition in 
Eq. (3.15) is not unique and a better solution may be found regularizing 
the problem with a Laplacian built on a similarity graph for each modality, 
i.e., Li, Ly, and Lu, and a £$ regularizer on each factor i.e. C, U, I and 
T. For TensorAnalysis, the complexity is O(|Pi| - (rr - m? + ru rr- r7)), 
proportional to the number P, of tags asserted in D and the dimension of 
low rank ry,17, rr factors. The memory required is O(n? 4- m? 4- u?) because 
of Laplacians of images, tags and users. 


3.4.12 Considerations 


An overview of the methods analyzed is given Table 3.3. Among them, 
SemanticField, counting solely on the tag modality, has the best scalability 
with respect to both computation and memory. Among the instance-based 
methods, TagRanking, which works on selected subsets of S rather than 
the entire collection, has the lowest memory request. When the number 
of tags to be modeled m is substantially smaller than the size of S, the 
model-based methods require less memory and run faster in the test stage, 
but at the expense of SVM model learning in the training stage. The two 
transduction-based methods have limited scalability, and can operate only 
on small sized S. 
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'Table 3.3: Main properties of the eleven methods evaluated in this survey following the 
dimensions of Fig. 2.1. The computational and memory complexity of each method is 
based on processing n test images and m test tags by exploiting the training set S. 


Learning 
Methods Test Media Task Train Computation Test Computation Train Memory Test Memory 
Instance-based: 
SemanticField tag Retrieval O(nml;) O(m?) 
Refinement 3 
TagCooccur tag O(nml;) O(m?) 
Retrieval 
TagRanking tag + image Retrieval O(n(mdn + Lm?)) O(max(dii, m?)) 
. " Assignment 
KNN tag + image j O(n(d|S| + k log |S|)) O(d|S|) 
Retrieval 
7 Assignment 
TagVote tag + image O(n(d|S| + k log |S|)) O(d|S]) 
Retrieval 
A Refinement 
TagCooccur+ tag + image O(n(d|S| + klog |S|)) O(d|S]) 
Retrieval 
Model-based: 
à Assignment 
TagProp tag + image O(l-m- k) O(n(d|S| + k log |S|)) O(d|S|--2m) O(d|S| + 2m) 
Retrieval 
Assigi t 
TagFeature tag + image Za 2G O(m(d + d')p) O(nm(d + d")) O((d + d')p) O(m(d + d')) 
Retrieval 
s Assignment 2 
RelExample tag + image O(mT dp?) O(dp + dq) O(nmd) O(mdq) 
Retrieval 
Transduction-based: 
Refinement P i 
RobustPCA tag + image O(em?n + d'n?) O(cnm + c' : (n? + m?)) 
Retrieval 
. tag + image Y à . 
TensorAnalysis Refinement O(|Pi| * (rr m? +ru-rr-rr)) O(n? + m? +u?) 


+ user 
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3.5 Evaluation 


This section presents our evaluation of the 11 methods according to their ap- 
plicability to the three tasks using the proposed experimental protocol, that 
is, KNN, TagVote, TagProp, TagFeature and RelExample for tag assign- 
ment (Section 3.5.1), TagCooccur, TagCooccur+, RobustPCA, and Tensor- 
Analysis for tag refinement (Section 3.5.2), and all for tag retrieval (Section 
3.5.3). For TensorAnalysis we were able to evaluate only tag refinement with 
BovW features on MIRFlickr with Train10k and Train100k. The reason for 
this exception is that our implementation of TensorAnalysis performs worse 
than the baseline. Consequently, the results of TensorAnalysis were kindly 
provided by the authors in the form of tag ranks. Since the provided tag 
ranks cannot be converted to image ranks, we could not compute MAP 
scores. Finally a comparison between our Flickr based training data and 
ImageNet is given in Section 3.5.4. 


3.5.1 Tag assignment 


Table 3.4 shows the tag assignment performance of KNN, TagVote, Tag- 
Prop, TagFeature and RelExample. Their superior performance against 
the RandomGuess baseline shows that learning purely from social media is 
meaningful. TagVote and TagProp are the two best performing methods on 
both test sets. Substituting CNN for BovW consistently brings improve- 
ments for all methods. 

In more detail, the following considerations hold. TagProp has higher 
MAP performance than KNN and TagVote in almost all the cases under 
analysis. As discussed in Section 3.4.5, TagProp is built upon KNN, but 
it weights the neighbor images by rank and applies a logistic model per 
tag. Since the logistic model does not affect the image ranking, the superior 
performance of TagProp should be ascribed to rank-based neighbor weight- 
ing. A per-tag comparison on MIRFlickr is given in Fig. 3.1. TagProp is 
almost always ahead of KNN and TagVote. Concerning TagVote and KNN, 
recall that their main difference is that TagVote applies the unique-user 
constraint on the neighborhood and it employs tag prior as a penalty term. 
'The fact that the training data contains no batch-tagged images minimizes 
the influence of the unique-user constraint. While the penalty term does 
not affect image ranking for a given tag, it affects tag ranking for a given 
image. This explains why KNN and TagVote have mostly the same MAP. 
Also, the result suggests that the tag prior based penalty is helpful for doing 
tag assignment by neighbor voting. 

We observe that RelExample has a better MAP than TagFeature in ev- 
ery case. The absence of a filtering component makes TagFeature more likely 
to overfit to training examples irrelevant to the test tags. For the other two 
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Table 3.4: Evaluating methods for tag assignment. Given the same feature, bold values 
indicate top performers on individual test sets. 


MIRFlickr NUS-WIDE 

Method Train10k Train100k Trainlm Train10k Train100k Trainlm 
MiAP scores: 

RandomGuess 0.147 0.147 0.147 0.061 0.061 0.061 
BovW + KNN 0.232 0.286 0.312 0.171 0.217 0.248 
BovW + TagVote 0.276 0.310 0.328 0.183 0.231 0.259 
BovW + TagProp 0.276 0.299 0.314 0.230 0.249 0.268 
BovW + TagFeature 0.278 0.294 0.298 0.244 0.221 0.214 
BovW + RelExample 0.284 0.309 0.303 0.257 0.233 0.245 
CNN + KNN 0.326 0.366 0.379 0.315 0.343 0.376 
CNN + TagVote 0.355 0.378 0.389 0.340 0.370 0.396 
CNN + TagProp 0.373 0.384 0.392 0.366 0.376 0.380 
CNN + TagFeature .359 0.378 0.383 0.367 0.338 0.373 
CNN + RelExample .309 0.385 0.373 0.365 0.354 0.388 
MAP scores: 

RandomGuess .072 0.072 0.072 0.023 0.023 0.023 
BovW 4- KNN .231 0.282 0.336 0.094 0.139 0.185 
BovW + TagVote .228 0.280 0.334 0.093 0.137 0.184 
BovW + TagProp .245 0.293 0.342 0.102 0.149 0.193 
BovW + TagFeature .200 0.199 .201 0.090 0.096 0.098 
BovW 4- RelExample .284 0.303 .310 0.119 0.155 0.172 
CNN + KNN 0.564 0.613 .639 0.271 0.356 0.400 
CNN + TagVote 561 0.613 638 0.257 0.358 0.402 
CNN + TagProp 586 0.619 0.641 0.305 0.376 0.397 
CNN + TagFeature 444 0.554 563 0.262 0.310 0.326 
CNN + RelExample 538 0.603 584 0.300 0.346 0.373 
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Figure 3.1: Per-tag comparison of methods for tag assignment on MIRFlickr, 
trained on Trainlm. The colors identify the features used: blue for BovW, red for 
CNN. The test tags have been sorted in descending order by the performance of CNN + 
TagProp. 


model-based methods, the overfit issue is alleviated by different strategies: 
RelExample employs a filtering component to select more relevant training 
examples, while TagProp has less parameters to tune. 


A per-image comparison on NUS-WIDE is given in Fig. 3.2. The test 
images are put into disjoint groups so that images within the same group 
have the same number of ground truth tags. For each group, the area 
of the colored bars is proportional to the number of images on which the 
corresponding methods score best. The first group, i.e., images containing 
only one ground-truth tag, has the most noticeable change as the training set 
grows. There are 75,378 images in this group, and for 39% of the images, 
their single label is ‘person’. When Trainlm is used, RelExample beats 
KNN, TagVote, and TagProp for this frequent label. This explains the 
leading position of RelExample in the first group. The result also confirms 
our earlier discussion in Section 3.3.1 that MiAP is likely to be biased by 
frequent tags. 


In summary, as long as enough training examples are provided, instance- 
based methods are on par with model-based methods for tag assignment. 
Model-based methods are more suited when the training data is of limited 
availability. However, they are less resilient to noise, and consequently a 
proper filtering strategy for refining the training data becomes essential. 


3.5.2 Tag refinement 


Table 3.5 shows the performance of different methods for tag refinement. 
We were unable to complete the table. In particular, RobustPCA could not 
go over 350k images due to its high demand in both CPU time and memory 
(see Table 3.3), while TensorAnalysis was provided by the authors only on 
MIRFlickr with Train10k, Train100k, and the BovW feature. 


49 


Image Understanding by Socializing the Semantic Gap 


Table 3.5: Evaluating methods for tag refinement. The asterisk (*) indicates results 
provided by the authors of the corresponding methods, while the dash (-) means we were 
unable to produce results. Given the same feature, bold values indicate top performers 
on individual test sets per performance metric. 


MIRFlickr NUS-WIDE 

Method Train10k Trainl00k  Trainlm Trainl0k Trainl00k Trainlm 
MiAP scores: 

UserTags 0.204 0.204 0.204 0.255 0.255 0.255 
TagCooccur 0.213 0.242 0.253 0.269 0.305 0.317 
BovW + TagCooccur+ 0.217 0.262 0.286 0.245 0.297 0.324 
BovW + RobustPCA 0.271 0.310 S 0.332 0.323 = 
BovW + TensorAnalysis *0.298 *0.297 = = = = 
CNN + TagCooccur+ 0.234 0.277 0.310 0.305 0.359 0.387 
CNN + RobustPCA 0.368 0.376 - 0.424 0.419 = 
CNN + TensorAnalysis = = - = - = 
MAP scores: 

UserTags 0.263 0.263 0.263 0.338 0.338 0.338 
TagCooccur 0.266 0.298 0.313 0.223 0.321 0.308 
BovW + TagCooccur+ 0.294 0.343 0.377 0.231 0.345 0.353 
BovW + RobustPCA 0.225 0.337 - 0.229 0.234 = 
BovW + TensorAnalysis - = = = = = 
CNN + TagCooccur+ 0.330 0.381 0.420 0.264 0.391 0.406 
CNN + RobustPCA 0.566 0.627 = 0.439 0.440 - 


CNN - TensorAnalysis - = = = = _ 
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Figure 3.2: Per-image comparison of methods for tag assignment on NUS- 
WIDE. Test images are grouped in terms of their number of ground truth tags. The area 
of a colored bar is proportional to the number of images that the corresponding method 
scores best. 


RobustPCA outperforms the competitors on both test sets, when pro- 
vided with the CNN feature. Fig. 3.3 presents a per-tag comparison on 
MIRFlickr. RobustPCA has the best scores for 9 out of the 14 tags with 
BovW, and wins all the tags when CNN is used. 


Concerning the influence of the media dimension, the tag + image based 
methods (RobustPCA and TagCooccur+) are in general better than the tag 
based method (TagCooccur). As shown in Fig. 3.3, except for 3 out of 14 
MIRFlickr test tags with BovW, using the image media is beneficial. As in 
the tag assignment task, the use of the CNN feature strongly improves the 
performance. 


Concerning the learning methods, TensorAnalysis has the potential to 
leverage tag, image, and user simultaneously. However, due to its rela- 
tively poor scalability, we were able to run this method only with Train10k 
and Train100k on MIRFlickr. For Train10k, TensorAnalysis yielded higher 
MiAP than RobustPCA, probably thanks to its capability of modeling user 
correlations. It is outperformed by RobustPCA when more training data is 
used. 


As more training data is used, the performance of TagCooccur, Tag- 
Cooccur+, and RobustPCA on MIRFlickr consistently improves. Since 
these three methods rely on data-driven tag affinity, image affinity, or tag 
and image affinity, a small set of 10k images is generally inadequate to com- 
pute these affinities. The effect of increasing the training set size is clearly 
visible if we compare scores corresponding to Train10k and Train100k. The 
results on NUS-WIDE show some inconsistency. For TagCooccur, MiAP 
improves from Train100k to Trainlm, while MAP drops. This is presum- 
ably due to the fact that in the experiments we used the parameters recom- 
mended in the original paper, appropriately selected to optimize tag ranking. 
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Figure 3.3: Per-tag comparison of methods for tag refinement on MIRFlickr, 
trained on Train100k. The colors identify the features used: blue for BovW, red for 
CNN. The test tags have been sorted in descending order by the performance of CNN + 
RobustPCA. 


Hence, they might be suboptimal for image ranking. BovW + RobustPCA 
scores a lower MAP than BovW + TagCooccur+. This is probably due 
to the fact that the low-rank matrix factorization technique, while being 
able to jointly exploit tag and image information, is more sensitive to the 
content-based representation. 


A per-image comparison is given in Fig. 3.4. As for tag assignment, the 
test images have been grouped according to the number of ground truth 
tags associated. The size of the colored areas is proportional to the number 
of images where the corresponding method scores best. For the majority of 
test image, the three tag refinement methods have higher average precision 
than UserTags. This means more relevant tags are added, so the tags are 
refined. It should be noted that the success of tag refinement depends much 
on the quality of the original tags assigned to the test images. Examples are 
shown in Table 3.7: in row 6, although the tag ‘earthquake’ is irrelevant to 
the image content, it is ranked at the top by RobustPCA. To what extent 
a tag refinement method shall count on the existing tags is tricky. 


To summarize, the tag + image based methods outperform the tag based 
method for tag refinement. RobustPCA is the best, and improves as more 
training data is employed. Nonetheless, implementing RobustPCA is chal- 
lenging for both computation and memory footprint. In contrast, TagCooc- 
cur+ is more scalable and it can learn from large-scale data. 


3.5.3 Tag retrieval 

Tables 3.8 and 3.9 show the performance of different methods for tag re- 
trieval. Recall that when retrieving images for a specific test tag, we con- 
sider only images that are labeled with this tag. Hence, MAP scores here 
are higher than their counterpart in Table 3.5. 


52 


Number of images with the best AP 


8 x10 


4 


Train10k - NUS-WIDE 


EE UserTags 

[y TagCooccur 
[CNN + TagCooccurPlus 
[ily CNN + RobustPCA 


1234567 8 910111213 
Number of ground truth tags 


Number of images with the best AP 


Tiberio Uricchio 


(ily CNN + RobustPCA 


1234567 8 9 10111213 
Number of ground truth tags 


Number of images with the best AP 


x10^ Train100k - NUS-WIDE x10^ Trainim - NUS-WIDE 
8 8 
EN UserTags EE UserTags 
(EGG TagCooccur 7 [uj TagCooccur 
[EGG CNN + TagCooccurPlus [GG CNN + TagCooccurPlus 


12345 67 8 910111213 
Number of ground truth tags 


Figure 3.4: Per-image comparison of methods for tag refinement on NUS- 
WIDE. Test images are grouped in terms of their number of ground truth tags. The 
area of a colored bar is proportional to the number of images that the corresponding 
method scores best. 
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Table 3.6: Selected tag assignment results on NUS-WIDE. Visual feature: BovW. The 
top five ranked tags are shown, with correct prediction marked by the bold italic font. 


Tag assignment 


Test image Ground truth User tags KNN TagVote | TagProp RelExample 
sign animal dog sign Soccer 
reptile flower house Street whale 

sign ZOO car bird flower book 
red horse sign dog toy 
white street bear bird moon 
flower garden flower garden 
. colour 
animal garden flower dog dog 
color 
dog d horse food garden fish 
o 
person 3 tree cat car fox 
hound : 
dog dog tree animal 
cloud cloud cloud cloud 
cloud cloud sky sky sky ocean 
grass beach water beach surf 
Tass 
sky T water beach water sky 
Snow mountain lake beach 
brown snow snow snow water 
animal bear beach animal beach sand 
bear salmon animal waterfall sand rock 
water national water tree bear surf 
park tree water water ocean 
irl sky snow airplane snow 
airplan 
pde cloud cloud sky frost 
cloud flag ì 
Ls Snow sky Snow bird 
military great : ; i ; 
bird mountain bird airplane 
sk 
y airplane bird airport tattoo 
china 
earthauake C grass car house 
a beach tree road road 
people 
h h water water street grass 
angzhou 
dn street road sky bird 
summer 
ee tree bridge bird sand 
: farmer car car police police 
police : 
d dog street Street car vehicle 
roa 
hicl motorcycle police police street street 
vehicle 
Á police vehicle vehicle road car 
window : 
train road Sport Sport sport 
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Table 3.7: Selected tag refinement results on NUS-WIDE. Visual feature: BovW. The top 
five ranked tags are shown, with correct prediction marked by the bold italic font. 


Tag refinement 


Test image Ground truth User tags TagCooccur TagCooccur+ RobustPCA 
sign animal sign sign 
reptile street bird bird 

sign ZOO sign dog flower 
red water animal animal 
white car toy street 
d d d 
i colour 0g 09 0g 
animal animal flower flower 
color ; ; 
dog car animal animal 
dog 
person beach cat water 
hound 
flower food garden 
grass cloud cloud 
cloud cloud sky sky grass 
grass tree water sky 
TOSS 
sky T flower beach water 
water tree mountain 
brown waterfall waterfall water 
animal bear water water waterfall 
bear salmon tree animal bear 
water national bear Snow animal 
park animal tree snow 
irl car Snow flag 
airplan 
add street sky sky 
cloud flag 
Ls Snow cloud snow 
military great F 
water mountain cloud 
sk 
x sky bird bird 
china 
water tree earthquake 
earthquake 
flower water water 
people 
street street tree 
hangzhou 
temple garden cloud 
ep tree car sk 
westlake y 
: farmer street car police 
police : 
d dog car street train 
roa 
hicl motorcycle animal police dog 
vehicle 
Á police train food bird 
window : j 
train bird horse car 
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We start our analysis by comparing the three baselines, namely UserTags, 
TagNum, and TagPosition, which retrieve images simply by the original 
tags. As it can be noticed, TagNum and TagPosition are more effective 
than UserTags, TagNum outperforms TagPosition on Flickr51, and the lat- 
ter has better scores on NUS-WIDE. The effectiveness of such metadata 
based features depend much on datasets, and are unreliable for tag retrieval. 


All the methods considered have higher MAP than the three baselines. 
All the methods have better performance than the baselines on Flickr51 and 
performance increases with the size of the training set. On NUS-WIDE, 
SemanticField, TagCooccur, and TagRanking, are less effective than Tag- 
Position. We attribute this result to the fact that, for these methods, the 
tag relevance functions favor images with fewer tags. So they closely follow 
similar performance and dataset dependency. 


Concerning the influence of the media dimension, the tag + image based 
methods (KNN, TagVote, TagProp, TagCooccur+, TagFeature, Robust- 
PCA, RelExample) are in general better than the tag based method (Se- 
manticField and TagCooccur). Fig. 3.5 shows the per-tag retrieval perfor- 
mance on Flickr51. For 33 out of the 51 test tags, RelExample exhibits 
average precision higher than 0.9. By examining the top retrieved images, 
we observe that the results produced by tag + image based methods and tag 
based methods are complementary to some extent. For example, consider 
‘military’, one of the test tags of NUS-WIDE. RelExample retrieves images 
with strong visual patterns such as military vehicles, while SemanticField 
returns images of military personnel. Since the visual content is ignored, 
the results of SemanticField tend to be visually different, so making it pos- 
sible to handle tags with visual ambiguity. This fact can be observed in 
Fig. 3.6, which shows the top 10 ranked images of ‘jaguar’ by TagPosition, 
SemanticField, BovW + RelExample, and CNN + RelExample. Although 
their results are all correct, RelExample finds jaguar-brand cars only, while 
SemanticField covers both cars and animals. However, for a complete eval- 
uation of the capability of managing ambiguous tags, fine-grained ground 
truth beyond what we currently have is required. 


Concerning the learning methods, TagVote consistently performs well 
as in the tag assignment experiment. KNN is comparable to TagVote, due 
to the reason we have discussed in Section 3.5.1. Given the CNN feature, 
the two methods even outperform their model-based variant TagProp. Sim- 
ilar to the tag refinement experiment, the effectiveness of RobustPCA for 
tag retrieval is sensitive to the choice of visual features. While BovW 4- 
RobustPCA is worse than the majority on Flickrt51, the performance of 
CNN + RobustPCA is more stable, and performs well. For TagFeature, its 
gain from using larger training data is relatively limited due to the absence 
of denoising. In contrast, RelExample, by jointly using SemanticField and 
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Table 3.8: Evaluating methods for tag retrieval, MAP scores. Given the same feature, 
bold values indicate top performers on individual test sets per performance metric. 


Flickr51 NUS-WIDE 

Method Trainl0k Trainl00k  Trainlm Train10k Trainl00k Trainlm 
MAP scores: 

UserTags 0.595 0.595 0.595 0.489 0.489 0.489 
TagNum 0.664 0.664 0.664 0.520 0.520 0.520 
TagPosition 0.640 0.640 0.640 0.557 0.557 0.557 
SemanticField 0.687 0.707 0.713 0.565 0.584 0.584 
TagCooccur 0.625 0.679 0.704 0.534 0.576 0.588 
BovW + TagCooccur+ 0.640 0.732 0.764 0.560 0.622 0.643 
BovW + TagRanking 0.685 0.686 0.708 0.557 0.574 0.578 
BovW + KNN 0.678 0.742 0.770 0.587 0.632 0.658 
BovW + TagVote 0.678 0.741 0.769 0.587 0.632 0.659 
BovW + TagProp 0.671 0.748 0.772 0.585 0.636 0.657 
BovW + TagFeature 0.689 0.726 0.737 0.589 0.602 0.606 
BovW + RelExample 0.706 0.756 0.783 0.609 0.645 0.663 
BovW + RobustPCA 0.697 0.701 = 0.650 0.650 -= 
BovW + TensorAnalysis = = - i - = 
CNN + TagCooccur+ 0.654 0.781 821 0.572 0.653 0.674 
CNN + TagRanking 0.744 0.735 TAT 0.589 0.590 0.590 
CNN + KNN 0.811 0.859 .880 0.683 0.722 0.734 
CNN + TagVote 0.808 0.859 0.881 0.675 0.724 0.738 
CNN + TagProp 0.824 0.867 879 0.689 0.727 0.731 
CNN + TagFeature 0.827 0.853 859 0.675 0.700 0.703 
CNN + RelExample 0.838 0.863 .878 0.689 0.717 0.734 
CNN + RobustPCA 0.811 0.839 - 0.725 0.726 - 
CNN + TensorAnalysis a = = = ia = 
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Table 3.9: Evaluating methods for tag retrieval, NDCGag scores. Given the same feature, 
bold values indicate top performers on individual test sets per performance metric. 


Flickr51 NUS-WIDE 

Method Trainl0k  TrainlOOk  Trainlm Train10k Train100k Trainlm 
NDCG% scores: 

UserTags 0.432 0.432 0.432 0.487 0.487 0.487 
TagNum 0.522 0.522 0.522 0.541 0.541 0.541 
TagPosition 0.511 0.511 0.511 0.623 0.623 0.623 
SemanticField 0.591 0.623 0.645 0.596 0.622 0.624 
TagCooccur 0.482 0.527 0.631 0.529 0.602 0.614 
BovW + TagCooccur+ 0.503 0.625 0.686 0.590 0.681 0.734 
BovW + TagRanking 0.530 0.568 0.571 0.557 0.572 0.572 
BovW + KNN 0.577 0.699 0.756 0.638 0.734 0.799 
BovW + TagVote 0.573 0.701 0.754 0.629 0.734 0.804 
BovW + TagProp 0.570 0.715 0.759 0.666 0.750 0.809 
BovW + TagFeature 0.547 0.626 0.646 0.622 0.615 0.618 
BovW + RelExample 0.614 0.722 0.748 0.692 0.736 0.776 
BovW + RobustPCA 0.549 0.548 = 0.768 0.781 - 
BovW + TensorAnalysis ES = - = B = 
CNN + TagCooccur+ 0.504 0.615 724 0.571 0.705 0.738 
CNN + TagRanking 0.577 0.607 597 0.578 0.594 0.583 
CNN + KNN 0.709 0.830 897 0.773 0.832 0.863 
CNN + TagVote 0.722 0.826 0.899 0.740 0.837 0.879 
CNN + TagProp 0.768 0.857 .865 0.764 0.839 0.845 
CNN + TagFeature 0.755 0.813 .818 0.704 0.807 0.787 
CNN + RelExample 0.764 0.843 .879 0.773 0.814 0.866 
CNN + RobustPCA 0.733 0.821 - 0.865 0.862 - 
CNN + TensorAnalysis = = = S — E 
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Figure 3.5: Per-tag comparison between TagPosition, SemanticField, TagVote, 
TagProp, and RelExample on Flickr51, with Trainlm as the training set. The 51 
test tags have been sorted in descending order by the performance of RelExample. 


'TagVote in its denoising component, is consistently better than TagFeature. 

'The performance of individual methods consistently improves as more 
training data is used. As the size of the training set increases, the per- 
formance gap between the best model-based method (RelExample) and 
the best instance-based method (TagVote) reduces. This suggests that 
large-scale training data diminishes the advantage of model-based methods 
against the relatively simple instance-based methods. 

In summary, even though the performance of the methods evaluated 
varies over datasets, common patterns have been observed. First, the more 
social data for training are used the better performance is obtained. Since 
the tag relevance functions are learned purely from social data without any 
extra manual labeling, and social data are increasingly growing, this result 
promises that better tag relevance functions can be learned. Second, given 
small-scale training data, tag + image based methods that conducts model- 
based learning with denoised training examples turn out to be the most 
effective solution, This however comes with a price of reducing the visual 
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(a) TagPosition (b) SemanticField (c) BovW (d) CNN 
+ RelExample + RelExample 


Figure 3.6: Top 10 ranked images of *jaguar', by (a) TagPosition, (b) Seman- 
ticField, (c) BovW + RelExample, and (d) CNN + RelExample. Checkmarks 
(v^) indicate relevant results. While both RelExample and SemanticField outperform the 
TagPosition baseline, the results of SemanticField show more diversity for this ambiguous 
tag. The difference between (c) and (d) suggests that the results of RelExample can be 
diversified by varying the visual feature in use. 


diversity in the retrieval results. Moreover, the advantage of model-based 
learning vanishes as more training data and the CNN feature are used, and 
TagVote performs the best. 


3.5.4 Flickr versus ImageNet 


To address the question of whether one shall resort to an existing resource 
such as ImageNet for tag relevance learning, this section presents an em- 
pirical comparison between our Flickr based training data and ImageNet. 
A number of methods do not work with ImageNet or require modifications. 
For instance, tag + image + user information based methods must be able 
to remove their dependency on user information, as such information is un- 
available in ImageNet. Tag co-occurrences are also strongly limited, because 
an ImageNet example is annotated with a single label. Because of these lim- 
itations, we evaluate only the two best performing methods, TagVote and 
TagProp. TagProp can be directly used since it comes from classic image 
annotation, while TagVote is slightly modified by removing the unique user 
constraint. The CNN feature is used for its superior performance against 
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Figure 3.7: Per-image comparison of TagVote/TagProp learned from different 
training datasets, tested on NUS-WIDE. Test images are grouped in terms of the 
number of ground truth tags. Within each group, the area of a colored bar is proportional 
to the number of images that (the method derived from) the corresponding training 
dataset scores the best. ImageNet200k is less effective for assigning multiple labels to an 
image. 


the BovW feature. 


To construct a customized subset of ImageNet that fits the three test 
sets, we take ImageNet examples whose labels precisely match with the test 
tags. Notice that some test tags, e.g., ‘portrait’ and ‘night’, have no match, 
while some other tags, e.g, ‘car’ and ‘dog’, have more than one matches. In 
particular, MIRFlickr has 2 missing tags, while the number of missing tags 
on Flickr51 and NUS-WIDE is 9 and 15. For a fair comparison these missing 
tags are excluded from the evaluation. Putting the remaining test tags 
together, we obtain a subset of ImageNet, containing 166 labels and over 
200k images, termed ImageNet200k. For a fair comparison, we considered 
only Train100k and Trainlm training sets of socially tagged images. 

'The left half of Table 3.10 shows the performance of tag assignment. 
TagVote/TagProp trained on the ImageNet data are less effective than their 
counterparts trained on the Flickr data. For a better understanding of the 
result, we employ the same visualization technique as used in Section 3.5.1, 
i.e., grouping the test images in terms of the number of their ground truth 
tags, and subsequently checking the performance per group. As shown in 
Fig. 3.7, while ImageNet200k performs better on the first group, i.e., images 
with a single relevant tag, it is outperformed by Train100k and Train1M on 
the other groups. For its single-label nature, ImageNet is less effective for 
assigning multiple labels to an image. 

For tag retrieval, as shown in the right half of Table 3.10, TagVote/Tag- 
Prop learned from ImageNet200k in general have higher MAP and NDCG 
scores than their counterparts learned from the Flickr data. By compar- 
ing the performance difference per concept, we find that the gain is largely 
contributed by a relatively small amount of concepts. Consider for instance 
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Table 3.10: Flickr versus ImageNet. Notice that the numbers on Train100k and Train1M 
are different from Tables 3.4, 3.8 and 3.9 due to the use of a reduced set of test tags. Bold 
values indicate top performers on a specific test set per performance metric. 


Tag Assignment 
MIRFlickr NUS-WIDE 


Training Set TagVote  TagProp  TagVote  TagProp 


MiAP scores: 


Train100k 0.377 0.383 0.392 0.389 
Train1M 0.389 0.392 0.414 0.393 
ImageNet200k 0.345 0.304 0.325 0.368 
MAP scores: 

Train100k 0.641 0.647 0.386 0.405 
Train1M 0.664 0.668 0.429 0.420 
ImageNet200k 0.532 0.532 0.363 0.362 


Tag Retrieval 


Flickr51 NUS-WIDE 

Training Set TagVote  TagProp  TagVote  TagProp 
MAP scores: 

'Irain100k 0.854 0.860 0.742 0.745 
Train1M 0.874 0.871 0.753 0.745 
ImageNet200k 0.873 0.873 0.762 0.762 
NDCG% scores: 

'Irain100k 0.838 0.863 0.849 0.856 
Train1M 0.894 0.851 0.891 0.853 
ImageNet200k 0.920 0.898 0.843 0.847 
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TagVote + ImageNet200k and TagVote + Train1M on NUS-WIDE. The 
former outperforms the latter for 25 out of the 66 tested concepts. By 
sorting the concepts according to their absolute performance gain, the top 
three winning concepts of TagVote + ImageNet200k are ‘sand’, ‘garden’, 
and ‘rainbow’, with AP gain of 0.391, 0.284, and 0.176, respectively. Here, 
the lower performance of TagVote + Train1M is largely due to the subjec- 
tiveness of social tagging. For instance, Flickr images labeled with ‘sand’ 
tend be much more diverse, showing a wide range of things visually irrel- 
evant to sand. Interestingly, the top three losing concepts of TagVote + 
ImageNet200k are ‘running’, ‘valley’, and ‘building’, with AP loss of 0.150, 
0.107, and 0.090, respectively. For these concepts, we observe that their 
ImageNet examples lack diversity. E.g., ‘running’ in ImageNet200k mostly 
shows a person running on a track. In contrast, the subjectiveness of social 
tagging now has a positive effect on generating diverse training examples. 

In summary, for tag assignment social media examples are a preferred 
resource of training data. For tag retrieval ImageNet yields better perfor- 
mance, yet the performance gain is largely due to a few tags where social 
tagging is very noisy. In such a case, controlled manual labeling seems indis- 
pensable. In contrast, with clever tag relevance learning algorithms, social 
training data demonstrate competitive or even better performance for many 
of the tested tags. Nevertheless, where the boundary between the two cases 
is precisely located remains unexplored. 


3.6 Conclusions 


Having established the common ground between methods, a new experi- 
mental protocol was introduced for a head-to-head comparison between the 
state-of-the-art. A selected set of eleven representative works were imple- 
mented and evaluated for tag assignment, refinement, and/or retrieval. The 
evaluation justifies the state-of-the-art on the three tasks. For tag assign- 
ment, TagProp and TagVote perform best. For tag refinement, RobustPCA 
is the choice. For tag retrieval, TagVote achieves the best overall perfor- 
mance. Concerning what media is essential for tag relevance learning, tag + 
image is consistently found to be better than tag alone. While the joint use 
of tag, image, and user information (via TensorAnalysis) demonstrates its 
potential on small-scale datasets, it becomes computationally prohibitive 
as the dataset size increases to 100k and beyond. Comparing the three 
learning strategies, instance-based and model-based methods are found to 
be more reliable and scalable than their transduction-based counterparts. 
As model-based methods are more sensitive to the quality of social image 
tagging, a proper filtering strategy for refining the training media is crucial 
for their success. Despite their leading performance on the small training 
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dataset, we find that the performance gain over the instance-based alter- 
natives diminishes as more training data is used. Finally, the CNN feature 
used as a substitute for the BovW feature brings considerable improvements 
for all the tasks. 
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Chapter 4 
A. Cross Modal Approach for Tag Assignment 


Tag assignment is still an important open problem in multime- 
dia and computer vision. Many approaches previously proposed 
in the literature do not accurately capture the intricate depen- 
dencies between image content and annotations. We propose a 
learning procedure based on Kernel Canonical Correlation Anal- 
ysis which finds a mapping between visual and textual words by 
projecting them into a latent meaning space. The learned map- 
ping is then used to annotate new images using advanced nearest 
neighbor methods. We evaluate our approach on three popular 
datasets, and show clear improvements over several approaches 
relying on more standard representations. * 


4.1 Introduction 


'The exponential growth of media sharing websites, such as Flickr or Picasa, 
and social networks such as Facebook, has led to the availability of large 
collections of images tagged with human-provided labels. These tags reflect 
the image content and can thus be exploited as a loose form of labels and 
context. Several researchers have explored ways to use images with associ- 
ated labels as a source to build classifiers or to transfer their tags to similar 
images (Duygulu et al., 2002; Makadia et al., 2008; Guillaumin et al., 2009; 
Li et al., 20095; Li and Fei-Fei, 2010; Znaidia et al., 2013). Image annota- 
tion is therefore a very active subject of research (Metzler and Manmatha, 
2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Liu, Li, Liu, Lu and Ma, 
2009; Zhang et al., 2010; Verma and Jawahar, 2012) since we can clearly 
increase performance of search and indexing over image collections that are 
machine enriched with a set of meaningful labels. In this chapter we tackle 
the problem of assigning a finite number of relevant tags to an image, given 


lParts of the work presented in this chapter have been published in Ballan, L., Uricchio, 
T., Seidenari, L., and Del Bimbo, A. (2014, April). “A cross-media model for automatic image 
annotation”. In Proceedings of International Conference on Multimedia Retrieval (p. 73). ACM. 
The publication is available at http: //dx.doi.org/10.1145/2578726 . 2578728. 


Tiberio Uricchio, Image Understanding by Socializing the Semantic Gap, ISBN 978-88-6453-576-0 (print), 
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the image appearance and some prior knowledge on the joint distribution 
of visual features and tags based on some weakly and noisy annotated data. 


The main shortcomings of previous works in the field are twofold. The 
first is the aforementioned semantic gap problem, which points to the fact 
that it is hard to extract semantically meaningful entities using just low 
level visual features. 'The second shortcoming arises from the fact that 
many parametric models, previously presented in the literature, are not 
rich enough to accurately capture the intricate dependencies between image 
content and annotations. Recently, nearest neighbor based methods have 
attracted much attention since they have been found to be quite successful 
for tag prediction (Makadia et al., 2008; Guillaumin et al., 2009; Li et al., 
20095; Uricchio et al., 2013; Znaidia et al., 2013) (see also Chapter 2 and 
3). This is mainly due to their flexibility and capacity to adapt to the 
patterns in the data as more training data is available. The base ingredient 
for a vote based tagging algorithm is of course the source of votes: the 
set of K nearest neighbors. In challenging real world data it is often the 
case that the vote casting neighbors do not contain enough statistics to 
obtain reliable predictions. This is mainly due to the fact that certain tags 
are much more frequent than others and can cancel out less frequent but 
relevant tags (Guillaumin et al., 2009; Li et al., 20095). It is obvious that all 
voting schemes can benefit from a better set of neighbors. We believe that 
the main bottleneck in obtaining such ideal neighbors set is the semantic 
gap. We address this problem using a cross-modal approach to learn a 
representation that maximizes the correlation between visual features and 
tags in a common semantic subspace. 


In Figure 4.1 we show our intuition with an example provided by real 
data. We compare for the same query, a flower close-up, the first thirty-five 
most similar examples provided by the visual features and by our represen- 
tation. The first thing to notice is the large visual and semantic difference 
between the sets of retrieved neighbors by the two approaches. Note also 
that some flower pictures, which we highlight with a dashed red rectangle, 
were not tagged as such. Second, note how the result presented in Figure 
4.1(b) have more and better ranked flower images than the one in Figure 
4.1(a). Indeed with the result set in Figure 4.1(a) it is not possible to obtain 
a sufficient amount of meaningful neighbors and the correct tag flower is 
canceled by others such as dog or people. 


In this chapter we present a cross-media approach that relies on Kernel 
Canonical Correlation Analysis (KCCA) (Hardoon and Shawe-Taylor, 2003; 
Hardoon et al., 2004) to connect visual and textual modalities through a 
common latent meaning space (called semantic space). Visual features and 
labels are mapped to this space using feature similarities that are observ- 
able inside the respective domains. If mappings are close in this semantic 
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Figure 4.1: Nearest neighbors found with baseline representation (a) and with our pro- 
posed method (b) for a flower image (first highlighted in yellow in both figures) from the 
MIRFlickr-25K dataset. Training images with ground truth tag flower are highlighted 
with a red border. Nearest neighbors are sorted by decreasing similarity and arranged 
in a matrix using a row-major convention. Dashed red lines indicate flower pictures not 
tagged as such. 
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space, the images are likely to be instances of the same underlying seman- 
tic concept. The learned mapping is then used to annotate new images 
using a nearest-neighbor voting approach. We present several experiments 
using different voting schemes. First, the simple KNN voting of Makadia et 
al. (Makadia et al., 2008), and second three advanced NN models such as 
TagVote (Li et al., 20090), TagProp (Guillaumin et al., 2009) and 2PKNN 
(Verma and Jawahar, 2012). 


4.1.1 Contribution 


Other existing approaches learn from both words and images, including 
previous uses of CCA (Hardoon and Shawe-Taylor, 2003; Rasiwasia et al., 
2010; Hwang and Grauman, 2012; Gong et al., 2013). In contrast, we are 
the first to propose an approach that combines an effective cross-modal 
representation with advanced nearest-neighbor models for the specific task 
of tag assignment. 

In the following we show that, if combined with advanced NN schemes 
able to deal with the class-imbalance (i.e. large variations in the frequency of 
different labels), our cross-media model achieves high performance without 
requiring heavy computation such as in the case of metric learning frame- 
works with many parameters (as in (Guillaumin et al., 2009; Verma and 
Jawahar, 2012)). 

We present experimental results for two standard datasets, Corel5K 
(Duygulu et al., 2002) and IAPR-TC12 (Grubinger et al., 2006), obtain- 
ing highly competitive results. We report also experiments on a challenging 
dataset collected from Flickr, i.e. the MIRFlickr-25K dataset (Huiskes and 
Lew, 2008), and our results show that the performance of the proposed 
method is boosted even further in a realistic and more interesting scenario 
such as the one provided by weakly-labeled images. 


4.2 Related Work 


In the multimedia and computer vision communities, jointly modeling im- 
ages and text has been an active research area in the recent years. A first 
group of methods uses mixture models to define a joint distribution over 
image features and labels. The training images are used by these mod- 
els as components to define a mixture model over visual features and tags 
(Lavrenko et al., 2003; Feng et al., 2004; Carneiro et al., 2007). They can 
be interpreted as non-parametric density estimators over the co-occurrence 
of images and labels. In another group of methods based on topic models 
(such as LDA and pLSA), each topic represents a distribution over image 
features and labels (Barnard et al., 2003; Monay and Gatica-Perez, 2004). 
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'These kind of generative models may be criticized because they maximize 
the generative data likelihood, which is not optimal for predictive perfor- 
mance. Another main criticism of these models is their need for simplifying 
assumptions in order to do tractable learning and inference. 


Discriminative models such as support vector machines have also been 
proposed (Grangier and Bengio, 2008; Verma and Jawahar, 2013). These 
methods learn a classifier for each label, and use them to predict whether a 
test image belongs to the class of images that are annotated with a particular 
label. A main criticism of these works resides in the necessity to define in 
advance the number of labels and to train individual classifiers for each 
of them. This is not feasible in a realistic scenario like the one of web 
images. Despite their simplicity, nearest-neighbor based methods for image 
annotation have been found to give state-of-the-art results (Makadia et al., 
2008; Guillaumin et al., 2009; Verma and Jawahar, 2012). The intuition is 
that similar images share common labels. The common procedure of the 
existing nearest-neighbor methods is to search for a set of visually similar 
images and then to select a set of relevant associated tags based on a tag 
transfer procedure (Makadia et al., 2008; Li et al., 20095; Guillaumin et al., 
2009). In all these previous approaches, this similarity is determined only 
using image visual features. 


4.3 Approach 


The proposed method is based on KCCA which provides a common rep- 
resentation for the visual and tag features. We refer to this common rep- 
resentation as semantic space. Similarly to (Hardoon and Shawe-Taylor, 
2003; Hwang and Grauman, 2012) we use KCCA to connect visual and 
textual modalities, but our method is designed to effectively tackle the par- 
ticular problem of image auto-annotation. In Section 4.3.1 we present our 
visual and text features with their respective kernels; next we briefly de- 
scribe KCCA (Section 4.3.2) and the different NN schemes (Section 4.3.3). 
In Figure 4.2 we show an embedding computed with ISOMAP (Tenenbaum 
et al., 2000) of the visual data and its semantic projection. We randomly 
pick three tags to show how the semantic projection that we learn with 
KCCA better suits the actual distribution of tags with respect to the vi- 
sual representation. The semantic projection improves the separation of the 
classes, allowing a better manifold reconstruction and, as our experiments 
will confirm, an improvement on precision and recall on different datasets. 
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4.3.1 Visual and Tags Views 
Visual Feature Representation and Kernels 


We directly use the 15 features provided by the authors of (Guillaumin 
et al., 2009; Verbeek et al., 2010)?. These are different types of global and 
local features commonly used for image retrieval and categorization. In 
particular we use two types of global descriptors: Gist and color histograms 
with 16 bins in each channel for RGB, LAB, HSV color spaces. Local 
features include SIFT' and robust hue descriptors, both extracted densely 
on a multi-scale grid or for Harris-Laplacian interest points. The local 
feature descriptors are quantized using k-means and then all the images 
are represented as bag-of(visual)words histograms. The histograms are 
also computed in a spatial arrangement over three horizontal regions of the 
image, and then concatenated to form a new global descriptor that encodes 
some information of the global spatial layout. 


In this work we use x? exponential kernels for all visual features f € F: 


1 5 (hi) - md 
Ky2(hi, hj) = ex ; 4.1 
a (has ha) »( 24 2^ (5) + A(8) on 
where A is the mean of the x? distances among all the training examples, 
d is the dimensionality of a particular feature descriptor and h; is its re- 
spective histogram representation. It has to be noticed that all the feature 
descriptors are L1-normalized. Finally, all the different visual kernels are 
averaged to obtain the final visual representation. We obtain the kernel 
between two images J;, J; via kernel averaging: 


K,(I5.1;) = & S NOE (B ley, (4.2) 
E 


Tag Feature Representation and Kernel 


We use as tag features the traditional bag-of-words which records which 
labels are named in the image, and how many times. Supposing V is our 
vocabulary size, i.e. the total possible words used for annotation, each tag- 
list is mapped to an V-dimensional feature vector h = [w1,--- , wy], where 
w; counts the number of times the i-th word is mentioned in the tag list. In 
our case this representation is highly sparse and often counts are simply 0 or 
1 values. We use these features to compute a linear kernel that corresponds 


?These features are available at: http://lear.inrialpes.fr/people/guillaumin/data. php. 
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Figure 4.2: Visualization of three labels (Corel5K): (a) distribution of image features in 
the visual space (b) distribution of the same images after projecting into the semantic 
space learned using KCCA. Note the clearer distinction of the clusters in the semantic 
space. 
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to counting the number of tags in common between two images: 


V 


K,(h;, hy) =< hi, h; >= Y hi(k)h;(k). (4.3) 


4.3.2 Kernel Canonical Correlation Analysis 


Given two views of the data, such as the ones provided by visual and textual 
modalities, we can construct a common representation. Canonical Corre- 
lation Analysis (CCA) seeks to utilize data consisting of paired views to 
simultaneously find projections from each feature space such that the cor- 
relation between the projected representations is maximized. In the liter- 
ature, the CCA method has often been used in cross-language information 
retrieval, where one queries a document in a particular language to retrieve 
relevant documents in another language. In our case, the algorithm learns 
two semantic projection bases, one per each modality (i.e. the v view is the 
visual cue while the t view is the tag-list cue). 

More formally, given N samples from a paired dataset ((vi, t1), ... (vw, tw), 
where v; € R” and t; € R” are the two views of the data, the goal is to 
simultaneously find directions w; and w; that maximize the correlation of 
the projections of v onto w, and t onto w. This is expressed as: 


E[(v, wo){t, wi) 


* 


wi, W; = arg max —— = = 
"Vv El(v, wo)? lEt, we)?] 
TC 
arg max ENSURE (4.4) 


, 
wurde x/WT Cows Ww] Cui 


where É denotes the empirical expectation, C,, and Cis respectively de- 
note the auto-covariance matrices for v and t data, and C,, denotes the 
between-sets covariance matrix. The solution can be found via a general- 
ized eigenvalue problem (Hardoon et al., 2004). 

The common CCA algorithm can only recover linear relationships, it 
is therefore useful to kernelize it by projecting the data into a higher- 
dimensional feature space by using the kernel trick. Kernel Canonical Cor- 
relation Analysis (KCCA) is the kernelized version of CCA. To this end, 
we define kernel functions over v and t as K,(v;,0;) = $,(v;)! $,(v;) and 
Ki(t;,t;) = ¢:(ti)" à:(t;). Here, the idea is to search for solutions of w,,w; 
that lie in the span of the N training instances ¢,(v;) and $(t;): 


Wy = 2- Aido (vi), 
Wt = 2- Bidi(ti), (4.5) 
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where i € {1,--- , N}. The objective of KCCA is thus to identify the weights 
o, B € RN that maximize: 


* * a’ K,K,B 
a*,B* = arg ma ; 
a VaTK?a9TK?g 


where K, and K, denote the N x N kernel matrices over a sample of N 
pairs. As shown by Hardoon (Hardoon et al., 2004), learning may need to be 
regularized in order to avoid trivial solutions. Hence, we penalize the norms 
of the projection vectors and obtain the standard eigenvalue problem: 


(4.6) 


(Ko + 41) K, (Ki + kI) K.a = Ma. (4.7) 


The top D eigenvectors of this problem yield basis A = [a = o(O?] and 
B= [Be Fe use that we use to compute the semantic projections of any 
vector Vi, ti. 


Implementation Details 


In order to avoid degeneracy with non-invertible Gram matrices and to 
increase computational efficiency we approximate the Gram matrices using 
the Partial Gram-Schmidt Orthogonalization (PGSO) algorithm provided 
by Hardoon et al. (Hardoon et al., 2004).As suggested in (Hardoon et al., 
2004) the regularization parameter & is found by maximizing the difference 
between projections obtained by correctly and randomly paired views of the 
data on the training set. In the experiments we have optimized both the 
parameters of the PGSO algorithm (i.e. & and T); however, we found as 
a good starting configuration the setting T = 30 and & = 0.1. We also 
found important swapping the use of visual and textual spaces as Hardoon 
(Hardoon et al., 2004) fixes A to be unit vectors while computing B on the 
basis of the two kernels. 


4.3.8 Tag Assignment Using Nearest Neighbor Models in the 
Semantic Space 


'The intuition underlying the use of nearest-neighbor methods for tag as- 
signment is that similar images share common labels. Following this key 
idea, we have investigated and applied several NN schemes to our semantic 
space in order to automatically annotate images. We briefly describe these 
models below and refer the interested reader to the Chapter 3. 

For all baseline methods the K neighbors of a test image J; are selected 
as the training images J; for which our averaged test kernel value K,(I;, L), 
defined in Eq. 4.2, scores higher. In case the semantic space projection is 
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used, the K neighbors are computed using: 


vi)? - v) 
alle - e)la 


where w(J;) is the semantic projection of a test image I;. The projection of 
I; is defined as v(I;) = K,(I;,:)Y A, where K,(I;,-) is the vector of kernel 
values of a sample J; and all the training samples. Note that we only use 
the visual view of our data both for training and test samples. 


d(v(1;),v(15))) = 1 (4.8) 


KNN 


Given a test image, we project onto the semantic space and identify its 
K Nearest-Neighbors. Then we merge their labels to create a tag-list by 
counting all tag occurrences on the K retrieved images, and finally we re- 
order the tags by their frequency. If we fix K to a very small number (e.g. 
K = 2) this approach is similar to the ad-hoc nearest neighbor tag transfer 
mechanism proposed by Makadia et al. (Makadia et al., 2008). 


Tag Vote 


Li et al. (Li et al., 20095) proposed a tag relevance measure based on the 
consideration that if different persons label visually similar images using 
the same tags, then these tags are more likely to reflect objective aspects 
of the visual content. Following this idea it can be assumed that, given 
a query image, the more frequently the tag occurs in the neighbor set, 
the more relevant it might be. However, some frequently occurring tags 
are unlikely to be relevant to the majority of images. To account for this 
fact the proposed tag relevance measurement takes into account both the 
distribution of a tag t in the neighbor set for an image J and in the entire 
collection: 

tagVote(l, I, K) := m|N (I, K)] — Prior(t), (4.9) 


where n; is an operator counting the occurrences of t in the neighborhood 
N(I, K) of K similar images, and Prior(t) is the occurrence frequency of t 
in the entire collection. 


TagProp 


Guillaumin et al. (Guillaumin et al., 2009) proposed an image annotation 
algorithm in which the main idea is to learn a weighted nearest neighbor 
model, to automatically find the optimal combination of multiple feature 
distances. Using yi; € {—1, +1} to represent if tag t is relevant or not for 
the test image /;, the probability of being relevant given a neighborhood of 
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K images I; € N(L, K) = (hh, h,..., Ix} is: 


p(ya = +1) =) rg p(ya = +N (L, K)), (4.10) 
LjeN(I;, K) 
l—e for yn =+1, 
p(ya = +1|N (L, K)) = { : Ey eae (4.11) 


I;€N(i,K) 


where 7;; is the weight of a training image I; of the neighborhood N (J, K) 
and p(ya = +1|N(I;, K)) is the prediction of tag t according to each neigh- 
bor in the weighted sum. 

The model can be used with rank-based (RK) or distance-based weight- 
ing; the latter can be learnt by using a single distance (referred to as the 
SD variant) or using metric learning (ML) over multiple distances. Further- 
more, to compensate for varying frequencies of tags, a tag-specific sigmoid 
is used to scale the predictions, to boost the probability for rare tags and 
decrease that of frequent ones. Sigmoids and metric parameters can be 
learned by maximizing the log-likelihood 57; , In p(yit). 


2PKNN 


Verma and Jawahar (Verma and Jawahar, 2012) proposed a two phase 
method: a first pass is employed to address the class-imbalance by con- 
structing a balanced neighborhood for each test image and then a second 
pass, where the actual tag importance is assigned based on image similarity. 

The problem of image annotation is formulated similarly as Guillaumin 
et al. (Guillaumin et al., 2009), by finding the posterior probabilities: 


P( yield) = Piha Cue) 


(4.13) 

Given a test image J;, and a vocabulary Y = {t1,t2,..., tm}, the first 
phase collects a set neighborhoods T;; for each tag t € Y by selecting at 
least the nearest M training images annotated with t. The neighborhood 
of image J; is then given by N(I;) = Usey Tu- It should be noticed that a 
tag can have less than M training image and therefore N(J;), may still be 
a lightly unbalanced set of tags. 

On the second phase of 2PKNN, given a tag t € Y, the probability 
P(I;|t) is estimated by the neighborhood defined in phase one for image I: 


P(It) = y exp(-D(I;, I;))p(ya = +1|N(4)) (4.14) 


I;€N(I;) 
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(a) Corel5K 
NN-voting | TagVote TagProp 2PKNN 
a a| <| 4| B « 
m e [na < m Na aa) Na 
P 26 37 25 36 | 29 35 36 42 
R 30 36 35 37 | 35 40 38 46 
N+ || 135 139 | 151 144 | 144 149 | 169 179 
(b) IAPR-TC12 
NN-voting TagVote TagProp 2PKNN 
o o o o 
3 D s Q 3 O0 3 oO 
ea wa [sa < m Na m e 
P 32 56 27 57| 37 58 46 59 
R 21 25 26 28 | 22 26 29 30 
N+ || 235 213 | 258 246 | 225 235 | 272 259 


(c) MIRFlickr-25K 


NN-voting TagVote TagProp 2PKNN 
a D s 0 3 O0 38 O 
mM wa em xim xm «x 
P 34 5l 38 50 | 37 55 | 16 56 
R 26 35 22 37 | 26 36 |6 25 
N+ || 17 18 18 18 | 18 18 | 16 18 


Table 4.1: This table shows the results of several configurations of our method based on 
KCCA and baselines on the Corel5K , IAPR-TC12 and MIRFlickr-25K datasets. 


where p(y = +1|N(J;)) is the presence of tag t for image J; as in Guillaumin 
et al. (Guillaumin et al., 2009) and D(I;,I;) is the distance between image 
I; and qs 


In the simplest version of this algorithm DD(I;, J;) is just a scaled version 
of the distance wD(I;,I;), where w is a scalar. Authors in (Verma and 
Jawahar, 2012) also propose a more complex version where D(/;,/;) can be 
parameterized as a Mahalanobis distance where the weight matrix can be 
learned in a way that the resulting metric will pull the neighbors from the 
T, belonging to ground-truth tags closer and push far the remaining ones. 
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4.4 Experiments 


We evaluate the performance of our cross-media model for tag assignment 
on three popular datasets and we compare it to closely related work. 


Previously reported results ML 
— c 1 = > a 
~ fs Q D n e a 
tt e 2 e = 2 Q 
d 3 ead s Sla £ 
a = Zi 2 = 2 d T d S 
© 8 © ca Um a o g E z E E 
= E a e 5 E ; ng E S8 GSO ^5 
a S8 : $ 8 ài 3s B8 ¢ > E è 
n S a S a 3 = ds 3 H a ise} S z B 
zi = 5 e my 4 e D E = = d la] 3 d 
S g € © fd d Q d d Š a » uo) =| E 
2 E o d S = N g= 5 E È Ss z g 
D È) > g Wd > 0H Qoo mc & B Se E 5 
$ s dà Ë o qd d 8 3 S = 5$ gly glz 
do Ww 85 P £ i gs d 545 o 9$ & Bg olg 
x B E: S L El w ‘B (a Nn >m, S = 4 a 
3 2 8 & H 3 B O © è & B 25 5 + 
sal + 8 C So a a > Fla Z g 
Eig ex s 0 34 3 » o o è Gd JA > z 2 
s 9 Sa BoT & & & 28 Z2/e Z| È 
Z 2 OQ Ss d H - o0 &® %& E ze Sh & 5 
£c 4 & 2 3 O o d 3 > & 6 RE $ E = 
o 4 Z a n Bm o ^ B B ea SN ^ B N o 
P 16 17 18 24 23 25 30 28 26 28 29 32 39 33 44 42 
R 19 24 21 25 29 29 33 33 34 35 40 42 40 42 46 46 
N+ |107 112 114 122 137 131 146 140 143 145 157 179 177 160 191 | 179 


Table 4.2: This table shows the results of our method and related work on the Corel5K 
dataset (as reported in the literature). JEC-15 refers to the JEC (Makadia et al., 2008) 
implementation of (Guillaumin et al., 2009) that uses our 15 visual features. 


4.4.1 Datasets 


Corel5K. The Corel5K dataset (Duygulu et al., 2002) has been the stan- 
dard evaluation benchmark in the image annotation community for around 
a decade. It contains 5,000 images which are annotated with 260 labels and 
each image has up to 5 different labels (3.4 on average). This dataset is 
divided into 4,500 images for training and 500 images for testing. 


IAPR-TC12. This dataset was introduced in (Grubinger et al., 2006) 
for cross-language information retrieval and it consists of 17,665 training 
images and 1,962 testing images. Each image is annotated with an average 
of 5.7 labels out of 291 candidate. 


MIRFlickr-25K. The MIRFlickr-25K dataset has been recently intro- 
duced to evaluate keyword-based image retrieval methods. The set contains 
25,000 images that were downloaded from Flickr and for each one of these 
images the tags originally assigned by the users are available (as well as 
EXIF information fields and other metadata such as GPS). It is a very 
challenging dataset since the tags are weak labels and not all of them are 
actually relevant to the image content. There are also many meaningless 
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10 10 10° 


Figure 4.3: Precision and recall of all the methods on MIRFlickr-25k varying the number of 
nearest neighbors. Dashed lines represent baseline methods. Note that 2PKNN implicitly 


define the size of the neighborhood based only on the number of images per labels. 
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words. Therefore a pre-processing step was performed to filter out these 
tags. To this end we matched each tag with entries in Wordnet and only 
those tags with a corresponding item in Wordnet were retained. Moreover, 
we removed the less frequent tags, whose occurrence numbers are below 50. 
The result of this process is a vocabulary of 219 tags. The images are also 
manually annotated for 18 concepts (i.e. labels) that are used to evaluate 
the automatic annotation performances. As in (Verbeek et al., 2010), the 
dataset is divided into 12,500 images for training and 12,500 images for 
testing. 


4.4.2 Evaluation Measures 


We evaluate our models with standard performance measures, used in pre- 
vious work on image annotation. The standard protocol in the field is to 
report Precision and Recall for fixed annotation length (Duygulu et al., 
2002). Thus each image is annotated with the n most relevant labels (usu- 
ally, as in this chapter, the results are obtained using n — 5). Then, the 
results are reported as mean precision P and mean recall R over the ground- 
truth labels; N+ is often used to denote the number of labels with non-zero 
recall value. Note that each image is forced to be annotated with n labels, 
even if the image has fewer or more labels in the ground truth. "Therefore 
we will not measure perfect precision and recall figures. 


4.4.3 Results 


As a first experiment we compare our method with the corresponding near- 
est neighbor voting schemes. It can be seen from Table 4.1 that our approach 
improves over baseline methods in every setting on all datasets. Precision 
is boosted notably, confirming the better separation of the classes in the 
semantic space (as previously discussed in Section 4.3). Also recall is im- 
proved by a large margin on Corel5K and MIRFlickr-25k. On IAPR-TC12 
recall improvement is less pronounced. We believe this is due the different 
amount of textual annotation: IAPR-TC12 has an average of 5.7 tags per 
image (TPI) and up to 23 TPI while on Corel5K and MIRFlickr-25k the 
average TPI is respectively 3.4 and 4.7 with a maximum of 5 and 17 TPI 
respectively. Recalling that we are predicting n — 5 tags per image, recall 
is harder to improve on this dataset. 

We conduct an evaluation of how the amount of neighbours affect the 
performance for both our method and the baseline on the challenging MIRFlickr- 
25k dataset. As can be seen from Figure 4.3 the KCCA variants (solid lines) 
of the four considered voting schemes systematically improve both precision 
and recall for any amount of nearest neighbors used. Note that in both 
cases, a similar pattern emerges due the natural instability of NN methods. 
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It is interesting to note that while recall gets better as the neighborhood 
gets bigger, saturating at near 2, 000 neighbours, precision depends on the 
algorithm chosen. Basic voting and TagVote show an improvement until 
200 neighbors and then begin decreasing; TagProp improves until saturates 
at around 900. 

2PKNN misses a direct parameter to choose the dimension of the neigh- 
borhood, but it implicitly defines it by choosing at most M images per label. 
However, while it has a clear advantage on Corel5K and IAPR-TC12, both 
as a baseline and after the projection, it fails to achieve comparable per- 
formance on MIRFlickr-25K. We believe that this is due to the noisy and 
missing tags of MIRFlickr-25K, a notable difference on this more realistic 
and challenging dataset. 

Comparing with the state of the art, on Tables 4.2 and 4.3, our method 
achieves better performance than all previous works while it is compara- 
ble with the state of the art method 2PKNN (Verma and Jawahar, 2012) 
on Corel5K. Our method performs slightly worse than 2PKNN in metric 
learning configuration. However, metric learning involves a learning proce- 
dure with many parameters that rise the complexity of optimization and 
undermines scalability. 

Our method, once learned the semantic space, continues to work in what 
we call an open world setting. In this setting that is indeed more realistic, 
the amount of tags per image evolves over time. That is the case of big data 
from social media and, more in general, from the web. 

We also report in Table 4.4 a comparison with the methods presented 
in (Guillaumin et al., 2009; Verbeek et al., 2010) using per-image average 
precision (iAP). This measure indicates how well a method identifies rele- 
vant concepts for a given image. Our method combining the 2PKNN voting 
scheme, without metric learning, with the semantic projection outperforms 
all the other methods. 


Qualitative Analysis 


In Figure 4.4 we present some anecdotal evidence for our method (from the 
MIRFlickr-25k dataset). It can be seen that TagProp and TagVote perform 
better in general for the baseline representation and our proposed KCCA 
variant. It has to be noted that for challenging images where visual features 
can be deceiving our cross-modal approach allows to retrieve more tags. As 
an example see the first two rows: a close-up of a flower and a cloudy sunset 
with a road. For the first one it is not surprising that visual features do not 
provide enough good neighbors to retrieve the flower tag. For the second 
one none of the baseline method can retrieve the sunset and cloud tags; we 
believe that this is due to the lack of color features. In this two cases it is 
clear that semantically induced neighbors in the common space can boost 
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Table 4.3: This table shows the results of our method and related work on the IAPR-TC12 
dataset (as reported in the literature). 
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Table 4.4: This table shows the results of our method and related work (Verbeek et al., 
2010) on the MIRFlickr-25k dataset. 
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the accuracy. 

Another challenging example is shown at row five: a girl is depicted 
behind an object that hides a part of the face. This image component do not 
have enough visual neighbors to retrieve its tags. With our representation 
we are able to retrieve girl and portrait in the first three voting schemes and 
also people in the TagProp voting scheme, though face and woman may be 
considered correct even if not present in the ground truth tags. 


Baselines KCCA models 

NN-voting Tag Vote TagProp 2PKNN NN-voting TagVote TagProp 2PKNN 
dog dog graffiti graffiti flower flower flower graffiti 
graffiti graffiti dog dog flowers flowers flowers dog 
people animal people people pink pink green people 
black people face face green green pink face 
art house art art spring red white art 
sky clouds clouds clouds clouds clouds clouds clouds 
clouds sky sky sky sky sky sky sky 
water landscape water water landscape sunset landscape water 
landscape water landscape landscape sunset landscape sunset landscape 
trees trees trees trees blue cloud beach trees 
japan japan japan japan portrait portrait portrait japan 
art zoo water water girl girl girl water 
water dog dog dog tree woman green dog 
dog trees park park street tree tree park 
trees art art art green trees trees art 
pink pink pink pink food food food pink 
flower baby japan japan chocolate cake chocolate japan 
japan japan flower flower cake chocolate cake flower 
baby cake japanese japanese fruit dog red japanese 
portrait crochet vintage vintage red crochet fruit vintage 
japan japan japan japan portrait portrait portrait japan 
people man people people girl girl girl people 
man people animal animal girls face face animal 
street bicycle kid kid hair woman people kid 
bicycle animal eye eye face hair woman eye 
street street beach beach beach beach beach beach 
architecture snow street street sea sea sea street 
beach architecture people people clouds sunset clouds people 
white beach portrait portrait sky ocean ocean portrait 
snow home landscape landscape water clouds water landscape 
green green green green dog dog dog green 
garden waterfall grass grass animal animal animal grass 
people garden garden garden ZOO animals zoo garden 
flower bird feet feet green puppy dogs feet 
spring colours water water dogs dogs green water 


Figure 4.4: Anecdotal results of the baseline methods and our proposed representation 
for a set of challenging images (MIRFlickr-25K dataset). The tags are ordered by their 
relevance scores. 


4.5 Conclusions 


We presented a cross-media model based on KCCA to perform tag assign- 
ment. We learn semantic projections for both textual and visual data. This 
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representation is able to provide better neighbors for voting algorithms. The 
experimental results show that our method makes consistent improvements 
over standard approaches based on a single-view visual representation as 
well as other previous work that also exploited tags. We report also exper- 
iments on a challenging dataset collected from Flickr and our results show 
that the performance of the proposed method is boosted even further in a 
realistic scenario such as the one provided by weakly-labelled images. Pos- 
sible extensions of this work include the exploration of how richer textual 
and semantic cues from natural language annotations might improve our 
model. 
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Chapter 5 
Evaluating Temporal Information in Social Images 


Can we use the temporal gist of annotations in Web images to 
improve tasks such as annotation, indexing and retrieval? Typi- 
cally visual content and text, are used to improve these tasks. A 
characteristic that has received less attention, so far, is the tem- 
poral aspect of social media production and tagging. This chapter 
gives a thorough analysis of the temporal aspects of two popular 
datasets commonly used for tasks such as tag ranking, tag sug- 
gestion and tag refinement, namely NUS-WIDE and MIR-Flickr- 
1M. The correlation of the time series of the tags with Google 
searches shows that for certain concepts web information sources 
may be beneficial to annotate social media.* 


5.1 Introduction 


Typically visual content, text and metadata, such as geo-tags, are used 
to improve tasks such as annotation, indexing and retrieval of the huge 
quantities of media produced every day by the users of such systems. For 
instance, visual content similarity is used in (Li et al., 20095) to perform 
tag suggestion and image retrieval, tag co-occurrence has been proposed in 
(Sigurbjórnsson and van Zwol, 2008) for tag suggestion, geo-tags have been 
used in (Sizov, 2010) for tag recommendation, content classification and 
clustering. A recent review of the state-of-the-art in areas related to web- 
based social communities and social media has been presented in (Sundaram 
et al., 2012), considering in particular the contribution of contextual and 
social aspects of media semantics to multimedia applications. 

A characteristic that has received less attention, so far, is the temporal 
aspect of social media production. As noted in (Alonso et al., 2007), ex- 


1Parts of the work presented in this chapter have been published in Uricchio, T., Ballan, L., 
Bertini, M., and Del Bimbo, A. (2013, September). *Evaluating temporal information for social 
image annotation and retrieval". In International Conference on Image Analysis and Processing 
(pp. 722-732). Springer, Berlin, Heidelberg. The publication is available at http: //dx.doi.org/ 
10.1007/978-3-642-41181-6. 73. 
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soccer (r = 0.46) 


<+ -| — UserTags 
— GoogleTrends 


c. Aa 
2006 FIFA World Cup 
(9 June - 9 July) 


Figure 5.1: Time series of user tags and Google searches for “soccer” in NUS-WIDE 
dataset. 


tracting time information from documents may improve several applications 
such as hit-list clustering and exploratory search. More recently, several re- 
searchers have shown that the temporal information associated to search 
engine queries (e.g. frequency of query keywords over time) can be used to 
predict trends and behaviors related to economics and medicine, such as 
claims for unemployment benefits (Choi and Varian, 2011), and detection 
of flu epidemics (Ginsberg et al., 2009). 


In (Rattenbury et al., 2007) “burst” analysis techniques derived from 
signal processing are compared against a novel method to identify social 
events in the associated social media, using the tags and geo-localization 
information of Flickr images. In (Kim et al., 2010), the temporal evolu- 
tion of topics in social image collections is proposed to perform subtopic 
outbreak detection and to classify noisy social images. The authors used a 
non-parametric approach in which images are represented using a similar- 
ity network, created using Sequential Monte Carlo, where images are the 
vertices and the edges connect the temporally related an visually similar 
images. Temporal dynamics of social image collections has been studied in 
(Kim and Xing, 2013) to improve search relevance at query time, address- 
ing both a general case and personalized interest searches. The authors 
propose a unified statistical model based on regularized multi-task regres- 
sion on multivariate point process, in which an image stream is considered 
an instance of a process and a regression problem is formulated to learn the 
relations between image occurrence probabilities and temporal factors that 
influence them (e.g. seasons). 


Analysis of the temporal evolution of social media collections have been 
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proposed in (Jin et al., 2010) to predict political success and product sales; 
regression-based and diffusion-based models have been adapted to account 
for a Flickr-based index, combining images’ metadata and visual similarity, 
that models the popularity of politicians and products. The work presented 
in (Kim et al., 2012) re-casts the problem of image retrieval re-ranking as 
a prediction of which images will be more likely to appear on the web at 
a future time point. Both collective group level and individual user level 
cases are considered, using a multivariate point process to model a stream of 
input images, and using a stochastic parametric model to solve the relations 
between the occurrences of the images and factors such as visual clusters, 
user descriptors and month of the image. 

All the datasets used in these works are based on custom selections of 
user-generated images selected from Flickr, and are not publicly available. 
The main contribution of this chapter is a thorough analysis of the tem- 
poral aspects of two "standard" datasets commonly used for tasks such 
as tag ranking, tag suggestion and tag refinement (Liu, Hua, Yang, Wang 
and Zhang, 2009)(Li et al., 2009b)(Zhu et al., 2010)(Liu, Yan, Hua and 
Zhang, 2011)(Uricchio et al., 2013): NUS-WIDE (Chua et al., 2009) and 
MIR-Flickr-1M (Huiskes et al., 2010). These datasets provide images and 
associated metadata, along with a ground-truth annotation of 81 and 18 
tags, respectively. Analysis of the temporal evolution of both user tags and 
ground-truth tags allows to evaluate the social context (e.g. use of tags re- 
lated to the semantics associated to social interaction, and not necessarily 
associated with image content) and visual content (e.g. use of tags that are 
more strictly related to image content). The correlation of the time series of 
the tags with Google searches (see Fig. 5.1) shows that for certain concepts 
web information sources may be beneficial to annotate social media. 


5.2 Data Analysis Method 
5.2.1 Datasets 


To measure the impact of temporal information for image annotation pur- 
poses, we performed a quantitative analysis over two image datasets: NUS- 
WIDE (Chua et al., 2009) and MIR-Flickr-1M (Huiskes et al., 2010). 
NUS-WIDE is a large scale dataset collected from Flickr. It contains 
269,648 images, provided as multiple visual features and source URLs, with 
5,018 tags of which 81 have been manually checked and can be considered 
ground-truth tags. Tab. 5.1 reports the classification of these tags according 
to their main WordNet category. In order to obtain all temporal metadata 
not contained in the set, we had to download again all the original images 
from Flickr. Unfortunately, some images are not available anymore, there- 
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fore we had to use a subset of 238,251 images that are still present on Flickr. 
We refer to this subset as NUS-WIDE-240K. Images are unbalanced with 
respect to time, having very different number of images per date. The time 
interval goes from year 1900 (old photo scans) to 2009, concentrating most 
of the images between 2005-2008. 

MIR-Flickr-1M is also a large dataset crawled from Flickr which con- 
tains 1 million images, selected by their Flickr interestingness score (von 
Ahn and Dabbish, 2004)(Huiskes and Lew, 2008). Every image provided 
has full Flickr metadata which includes taken and posted timestamps, indi- 
cating when a photo was taken and when it was shared on Flickr. However, 
only about half of the images provide a valid “taken” timestamp, in par- 
ticular only 584,892 are valid, as 330,454 have no timestamps and 84,654 
have an invalid timestamp. Like NUS-WIDE-240K, images are unbalanced 
with respect to time. Images are concentrated around years 2007-2009. A 
ground-truth comprised of 18 tags is provided for the first 25,000 images 
only, that compose a subset called MIR-Flickr25K (Huiskes and Lew, 2008). 


5.2.2 Temporal features 


Given a set of images J, all taken in a set of dates D (as a daily interval), 
we denote as T' the set of all tags used and U the set of all users. For every 
image i € I we denote tag(i) C T the set of tags associated, day(i) € D the 
timestamp associated and user(i) € U the user who owns the image. We 
also consider two other time spans, a set of weeks W and a set of months 
M, easily computed by integrating over the interval of days considered. 
'These can be thought as time series over the selected index set. For every 
set considered, we computed a set of features, as proposed in (Kim et al., 
2012): 


e Images per day: the number of relevant images which are taken in 
a day. More specifically, given a day d € D, the number of images per 
day (IMD) is defined as 

IMD(d) := |{i € I|day(i) = d}| (5.1) 


Similarly we also define a feature for the number of images per week 
(IMW) and per month (IMM). 


Object 12 || Animal | 13 || Location 2 || Substance 2 
Action 5 || Plant 4 || Top 4 || Time 2 
Artifact | 26 || Event 4 Phenomenon | 4 || Person + Groups | 3 


Table 5.1: WordNet categories of NUS-WIDE ground-truth tags. 
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e Images per day for a tag: the number of relevant images associated 
with a tag which are taken in a day. More specifically, given a tag t € T' 
and a day d € D, the number of images with t per day (ITD) is defined 
as 

ITD(t,d) := |(i € I|day(i) = d ^t € tag(i)}| (5.2) 


Similarly we also define a feature per week (ITW) and per month 
(ITM). 


However, a phenomenon associated with a social source is that of batch 
tagging: a user may decide to upload an entire album of photos and, instead 
of carefully tagging each photo, he could simply opt to tag each photo with 
the same tags (e.g. tag the album instead of every single photo). This may 
result in a kind of noise with respect to the normal use of tags in time. 
In addition, the features defined above are sensitive to this kind of noise, 
producing noisy peaks over single days. To produce a more meaningful 
analysis we decide to collapse all images that are batch tagged into a single 
entry. A set of images are considered batch tagged if they are all uploaded 
by the same user on the same day and have the same set of tags. More 
specifically, given a user & € U, a day d € D and a set of tags ÊC T, a 
set of images Ip = [ià,15,...,i,] are considered batch tagged if tag(i) = 
f, user(i) = à, day(i) = d Vi € Ip. 


5.2.3 Flickr Popularity Model 


As described in (Jin et al., 2010), available images from the two datasets 
are only a sample of all images in Flickr. In addition, the number of images 
over time in Flickr are mostly variable, based on the popularity of the site 
itself. This slow change over time can be modeled as a trend over all tags, 
independent from any particular query. Unfortunately, no statistics are 
released publicly and other sources such as Alexa? or Google Trends? are 
affected by the impact of news. Based on this preliminary analysis and 
supposing an uniform sampling in Flickr searches, we use the feature IMD 
to remove this background deviation by normalizing the ITD feature. 
Given a tag t € T and a date d € D we compute: 

=i ITD(t, d) 

ITD(t, d) = TMD) (5.3) 
This may also be considered as a frequentist probability distribution of 
tag t in day d with respect to all other tags considered, which is p(t; d). 
Similarly we also compute ITW and ITM by considering a week and a 


2 Alexa Internet, Inc. http://www.alexa.com 
3Google Trends. http://www.google.com/trends 
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month granularity, respectively. After collapsing all batch tagged images, 
the two datasets retain 179,128 images for NUS-WIDE-240K and 531,670 
images for MIRFLICKR-1M respectively. 


5.2.4 Processing 


First of all we present a qualitative analysis by measuring the occurrence 
of tags in time. Given that NUS-WIDE-240K has the biggest ground truth 
of all datasets considered and that we are looking to discover the relations 
between tags and image content with respect to time, we choose to use it as 
the main reference. We use all the 81 manually checked tags as T' set and 
consider four different information sources which are different in the kind of 
underlining latent process : 


e From NUS-WIDE-240K, for all images, we consider the T' set of tags 
using the manually validated tags which constitute the entire ground 
truth; we refer to this source as NUS-GT. 


e From NUS-WIDE-240K, for all images, we consider the T set of tags 
using the user tags (e.g. the tags provided by the respective Flickr 
users); we refer to this source as NUS- TAGS. 


e From MIRFLICKR-1M, for all images, we consider the T set of tags 
using the user tags; we refer to this source as MIR-TAGS. 


e Beside image datasets, we also consider a source of temporal query 
information given by Google Trends. From Google Trends, we have 
downloaded all available query data for the T set of tags considered; 
we refer to this source as GOO-TAGS. 


All sources are to be considered subject to different kinds of noise, in par- 
ticular all images are highly unbalanced over time, resulting in days with 
hundreds of images and others with at most ten images. To reduce this 
effect, we choose to consider only the largest time span with at least 350 
images per week. In addition the two image datasets differ in the time in- 
terval which has the most images. This forced us to use a reduced time 
interval that we choose as starting from 2005-06-01 and ending in 2008-08- 
01 for NUS-WIDE-240K (retaining 161,176 images from 179,128) and from 
2007-01-01 to 2008-08-01 for MIR-Flickr-1M (retaining 110,064 images from 
531,670). Those filters were processed with a combination of Python scripts 
and Google Refine*. After this we used the R package (Team, 2011) to plot 
and execute any successive analysis. A plotting of features of this data re- 
vealed an insufficient reduction in noise to be able to clearly visualize most 
characteristics pattern. To make the time series patterns more clear, we 


^Google Refine. http://code.google.com/p/google-refine 
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computed a simple moving average over all time series, varying the win- 
dows size n from 2 to 10 weeks. For a day time series defined over a time 
span V for a tag t € T' is defined as: 


lconte 
ITD,(t, d) = > M ITD(t,d+i) vde Y (5.4) 


i——n 


'This has the effect to smooth the series, letting to visualize more clearly the 
trend. On the other hand, tags which have very sparse frequency tends to be 
worsened, so we adjusted the window size empirically, based on visualization 
clearness. The final time series are composed of 1,158 and 579 week samples 
respectively for NUS-WIDE-240K and MIR-Flickr-1M. 


5.2.5 Correlation analysis 


'To exploit the underlying time process and to be able to improve image 
annotation using temporal information, we need a way to evaluate quanti- 
tatively the possible correlation between sources. This allows us to analyze 
if a series can be estimated by another one and how a generalized model 
may describe the original time series. To this end we compute a correlation 
measure over two series. First of all we standardize all time series: given 
a time series X = (x; : i € D), we compute x; = LL. where X is the 
sample mean and s is the sample standard deviation. Even if sample mean 
and sample standard deviation are sensible to outliers, those are removed 
thanks to the filtering and smoothing procedure described above. To eval- 
uate the correlation between two time series, we choose to use the sample 
Pearson correlation coefficient, often denoted as r. Given two time series 
X and Y of n samples, r is defined as the ratio between covariance and the 
product of X variance and Y variance: 


Qoo Exin =H - Y) 
V Xn - XXL PP 
which is defined in [—1,1]. Values towards the positive or negative end 
reveal a strong correlation between the two time series, changing only in 


the sign. We can reformulate it as the mean of the products of the standard 
scores, which permits us to use standardized time series 3; = = and 


(5.5) 


ni pata 
pa 


sy 


Sy 


"OX (4S) =a baa (5.6) 


Given that the strength of correlation is not dependent on the direction or 
the sign, we also computed r-square. Unfortunately the interpretation of 
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a correlation coefficient depends heavily on the context and purposes that 
can't be easily defined at this stage of work. However several works like 
(Cohen, 1988) offered some guidelines which can be used to interpret our 
analysis, that are reported in Tab. 5.2. 


Correlation None Small Medium Strong 
Positive 0.0 to 0.09 | 0.1 to 0.3 0.3 to 0.5 0.5 to 1.0 
Negative -0.09 to 0.0 | -0.3 to -0.1 | -0.5 to -0.3 | -1.0 to -0.5 


'able 5.2: Guidelines for sample Pearson correlation coefficient. 


5.3 Experiments and Discussion 


In the following we will consider both the presence of the tags that have been 
added by the users that uploaded the images to Flickr (referring to them as 
“user tags") and the tags that have been manually checked by the creators 
of NUS-WIDE as referring to visual content of images (referring to them as 
“ground-truth” tags). In fact, several studies have shown that tags are often 
ambiguous and personalized (Kennedy et al., 2006) (Sigurbjórnsson and van 
Zwol, 2008), and do not necessarily reflect the visual content of the image. 
As an example consider Fig. 5.2 and 5.3, showing the temporal usage of the 
tags “snow” and “soccer” in NUS-WIDE, along with the respective Google 
searches, as obtained from Google Trends. It can be observed that the peak 
in usage of the "soccer" tag - associated with the 2006 FIFA World Cup - 
reflects that in Google Trends, but the peak is much less pronounced in the 
ground truth tags; this indicates that for this tag the relationship between 
tags and image may exist because of how people react to social events, 
rather than uploading photos depicting that event on Flickr. On the other 
hand the peaks of both user and ground truth “snow” tag are corresponding 
to that of Google Trends: in this case the relationship may exist because it 
is more likely that people take pictures of snow scenes during winter, and 
this concept is less related to social aspects than to visual content of these 
images. 


5.3.1 Temporal Evaluation 


Considering time series composed of the frequencies of image tags (either 
user or ground-truth) and Google searches obtained from Google Trends, it 
is possible to observe that they exhibit the presence of different components, 
that may appear mixed together: 


trend long term variation, that can be increasing, decreasing or also sta- 
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Figure 5.2: Frequency of “soccer” in NUS-GT, NUS-TAGS and GOO-TAGS: the peak 
of Google Trends and user tags in the summer of 2006 are related to the World Soccer 
Championship. 


ble (see Fig. 5.4). Terms such as “computer” or “military” have this 
pattern; 


cyclical variation repeated but not periodic variations. Tags like “sports” 
or “flags” have this pattern; 


seasonal variation periodic variations, e.g. due to concepts associated 
with some regular event (see Fig. 5.4). Concepts related to seasons 
show this behavior, like “garden”, “snow”, “beach” or “frost”; 


irregular variation random irregular variations, e.g. due to the sudden 
emergence of a topic (see Fig. 5.5), that appears as a burst of activ- 
ity. Concepts that exhibit this pattern are related to social or natural 
events like “soccer”, “earthquake” and “protest”. 
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Figure 5.3: Frequency of “snow” in NUS-GT, NUS-TAGS and GOO-TAGS: the peaks 
are associated with winter seasons. Tag frequencies have been normalized by the number 
of images of the same day. 


5.3.2 Correlation Analysis 


Fig. 5.6 reports the outcome of correlation analysis of NUS-TAGS with 
NUS-GT, NUS-TAGS with GOO-TAGS and NUS-GT with MIR-TAGS. 
In particular it can be observed that the correlation of NUS-TAGS and 
NUS-GT has a vast majority of “Medium” and “Strong” values, while the 
correlation between user tags and Google searches is overall weaker and 
can be useful for a selected number of tags. The correlation between NUS- 
GT and MIR-TAGS has a large number of “Medium” and “Strong” values, 
suggesting that the temporal information of NUS-WIDE can be used in 
MIR-Flickr-1M. 

Correlation analysis of NUS- TAGS with GOO-TAGS, followed by aver- 
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Figure 5.4: Time series patterns of NUS-TAGS and GOO-TAGS, averaged over 10 weeks. 
i) trend (computer); ii) seasonal (garden). 
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earthquake (r = 0.75) 
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— GoogleTrends 
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Figure 5.5: Time series patterns of NUS-TAGS and GOO-TAGS, averaged over 10 weeks. 
Episodic behavior (earthquake: peaks correspond to earthquakes in China and Pakistan). 


aging of r-square values over tags classes (Fig. 5.7 left) shows that Plant, 
Event, Phenomenon and Action obtain the higher values. A second group 
of categories comprises Artifact, Person+Group, Animal, Object and Time. 
In general, the categories that obtain the best performances are benefitting 
from tags whose time series show seasonal behaviors (e.g. “snow”, “frost”, 
“grass”, “leaf”) or have a “burst” behavior associated with specific social 
events (e.g. “soccer”, “protest”, “earthquake” ). 

Correlation analysis of NUS-GT with GOO-TAGS (Fig. 5.7 right) shows 
that Plant and Phenomenon categories maintain their position among the 
best performing classes, because of the tags that exhibit a seasonal pattern. 
Instead the correlation of Event and Action categories is lower because the 
ground-truth tags that have an episodic pattern like “soccer”, “protest” and 
“earthquake” have a lower correlation. This is due to the fact that these 
tags are employed by users also when the content of the image is not visually 
related to the described event. 


5.4 Conclusions 
This chapter presented a thorough analysis of the temporal aspects of user 
annotations in two popular large-scale datasets. The correlation of the time 


series of the tags with Google searches showed that for certain concepts web 
information sources may be beneficial to annotate social media. 
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Figure 5.6: i) r values computed between NUS-TAGS and NUS-GT; ii) r values computed 
between NUS-TAGS and GOO-TAGS; iii) r values computed between NUS-GT and MIR- 
TAGS. 
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Figure 5.7: NUS-WIDE dataset: r-square averages for tags classes. i) NUS-TAGS corre- 
lation with GOO-TAGS; ii) NUS-GT correlation with GOO-TAGS. 
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Chapter 6 
Multimodal Feature Learning for Sentiment Analysis 


In this chapter we investigate the use of a multimodal feature 
learning approach, using neural network based models such as 
Skip-gram and Denoising Autoencoders, to address sentiment anal- 
ysis of micro-blogging content, such as Twitter short messages, 
that are composed by a short text and, possibly, an image. Moti- 
vated by the recent advances of unsupervised learning of language 
models and visual features based om neural networks models, we 
propose a novel architecture that incorporates these models and 
test it on several standard Twitter datasets. We show that the 
approach is efficient and obtains good classification results. ! 


6.1 Introduction 


In the last few years micro-blogging services, in which users describe their 
current status by means of short messages, obtained a large success among 
users. Unarguably, one of the most successful services is Twitter?, that is 
used worldwide to discuss about daily activities, to report or comment news, 
and to share information using messages (called ‘tweets’) composed by at 
most 140 characters. Since 2011 Twitter natively supports adding images to 
tweets, easing the creation of richer content. A study performed by Twitter? 
has shown that adding images to tweets increases user engagement more 
than adding videos or hashtags. 

Despite their brevity these messages often convey also the feeling and the 
point of view of the people writing them. The addition of images reinforces 
and clarifies these feelings (see Fig.6.1). Automatic analysis of the sentiment 
of these tweets, i.e. retrieving the opinion they express, has received a large 


1Parts of the work presented in this chapter have been published in Baecchi, C., Uricchio, T., 
Bertini, M., and Del Bimbo, A. (2016). *A multimodal feature learning approach for sentiment 
analysis of social network multimedia". Multimedia Tools and Applications, 75(5), 2507-2525. 
'The publication is available at http://dx.doi.org/10.1007/s11042-015-2646-x. 

?Twitter reports to have 271 million monthly active users that send 500 million status updates 
per day - https://about.twitter.com/company 

3https:/ /blog.twitter.com/2014/ what-fuels-a-tweets-engagement 
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attention from the scientific community. This is due to its usefulness in 
analyzing a large range of domains such as politics (Tumasjan et al., 2010) 
and business (Ghiassi et al., 2013). Sentiment analysis may encompass 
different scopes (Bravo-Marquez et al., 2013): i) polarity, i.e. categorize a 
sentiment as positive, negative or neutral; ii) emotion, i.e. assign a sentiment 
to an emotional category such as joy or sadness; iii) strength, i.e. determine 
the intensity of the sentiment. 


So far, the vast majority of works have addressed only the textual data. 
In this chapter we address the classification of tweets, according to their po- 
larity, considering both textual and visual information. We propose a novel 
schema that, by incorporating a language model based on neural networks, 
can efficiently exploit web-scale sources corpus and robust visual features 
obtained from unsupervised learning. The proposed method has been tested 
on several standard datasets, showing promising results. 


Holding his bottle already ¥ #king #cairo Hey #tcot #Inyhbt, Remember this? 
Thank you George Bush. - via @CoronaRay 


4e Becty 13 ruens! W Fuerte t9 her + Roo TA Renee W Favorte 999 More 


Figure 6.1: Examples of tweets with images from the SentiBank Twitter dataset (Borth 
et al., 2013). left) positive sentiment tweet; right) negative sentiment tweet. 


'The chapter is organized as follows: Sect. 6.2 provides an overview of 
previous works; the proposed method is presented in Sect. 6.3, while ex- 
periments on four standard datasets and comparison with state-of-the-art 
approaches and baselines are reported in Sect. 6.4. Conclusions are drawn 
in Sect. 6.5. 
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6.2 Previous Work 


Sentiment analysis in texts. Brevity, sentence composition and variety 
of topics are among the main challenges in sentiment analysis of tweets 
(and micro-blogs in general). In fact these texts are short, often they are 
not composed carefully as news or product reviews, and cover almost any 
conceivable topic. Several specific approaches for T'witter sentiment anal- 
ysis have been proposed, typically using sentence-level classification with 
n-gram word models. Liu et al. (Liu et al., 2012) concatenate tweets of the 
same class (polarity) in large documents, from which a language model is 
derived and then classify tweets through maximum likelihood estimation, 
using both supervised and unsupervised data for training; the role of unsu- 
pervised data is to deal with words that do not appear in the vocabulary 
that can be built from a small supervised dataset. In (Bifet and Frank, 
2010) three approaches to sentiment classification are compared: Multi- 
nomial Naive Bayes (MNB), Hinge Loss with Stochastic Gradient Descent 
and Hoeffding Tree; the authors report that MNB outperforms the other 
approaches. In (Deitrick and Hu, 2013) unigram and bigram features have 
been used to train Naive Bayes classifiers, where bigrams help to account 
for negation of words. Saif et al. (Saif et al., 2013) have evaluated the use 
of a Max Entropy classifier on several Twitter sentiment analysis datasets. 
Since using n-grams on tweet data may reduce classification performance 
due to the large number of infrequent terms in tweets, some authors have 
proposed to enrich the representation using micro-blogging features such as 
hashtags and emoticons as in (Barbosa and Feng, 2010), or using semantic 
features as in (Saif et al., 2012). 


Neural networks language models. Recently, the scientific community 
has addressed the problem of learning vector representations of words that 
can represent information like similarity or other semantic and syntactic 
relations, obtaining better results than using the best n-gram models. The 
use of neural networks to perform this task is motivated by recent works 
addressing the scalability of training. In this formulation every word is rep- 
resented in a distributional space where operations like concatenation and 
averaging are used to predict other words in context, trained by the use of 
stochastic gradient descent and backpropagation. In the work of (Bengio 
et al., 2006), a model is trained based on the concatenation of several words 
to predict the next word: every word is mapped into a vector space where 
similar words have similar vector representations. A successive work uses 
multitask techniques (Collobert and Weston, 2008) to jointly train several 
tasks showing improvements in generalization. A fast hierarchical language 
model was proposed in (Mnih and Hinton, 2009), attacking the main draw- 
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back of needing long training and testing times. The use of unsupervised 
additional words was proposed by (Turian et al., 2010) showing further 
improvements using word features learned in advance to a supervised NLP 
task. Recently Mikolov et al. (Mikolov, Sutskever, Chen, Corrado and Dean, 
2013) have proposed several improvements on Hierarchical Softmax (Mnih 
and Hinton, 2009) and Negative Sampling (Gutmann and Hyvarinen, 2012) 
and introduced the Skip-gram model (Mikolov, Yih and Zweig, 2013), re- 
ducing further the computational cost, and showing fast training on corpora 
of billions of words (Mikolov, Sutskever, Chen, Corrado and Dean, 2013). 
More recently, researchers also extended these models, trying to achieve 
paragraph and document level representations (Le and Mikolov, 2014). 


Micro-blog multimedia analysis. Most of the works dealing with analysis 
of the multimedia content of micro-blogs have dealt with content summa- 
rization and mining, image classification and annotation. Geo-tagged tweet 
photos are used in (Yanai, 2012; Kaneko et al., 2013) to visually mine events 
using both textual and visual information. The system presented in (Serra 
et al., 2013) provides tools for content curation, creation of personalized 
web sites and magazines through topic detection of tweets and selection of 
representative associated multimedia. A system for exploration of events 
based on facets related to who, when, what, why and how of an event, has 
been presented in (Wang, Cui, Xie, Chen, Zhu and Yang, 2012), using a 
Bilateral Correspondence model (BC-LDA) for image and words. A multi- 
modal extension of LDA has been proposed in (Bian et al., 2013) to discover 
sub-topics in microblogs, in order to create a comprehensive summarization. 

An algorithm for photo tag suggestion using Twitter and Wikipedia are 
used in (McParlane and Jose, 2014) to annotate social media related to 
events, exploiting the fact that tweets about an event are typically tweeted 
during its development. Classification of tweets’ images in visually-relevant 
and visually-irrelevant, i.e. images that are correlated or not to the text of 
the tweet, has been studied in (Chen et al., 2013), using a combination of 
text, context and visual features. 

Zhao et al. (Zhao et al., 2012) have studied the effects of adding multi- 
media to tweets within Sina Weibo, a Chinese equivalent of Twitter, finding 
that adding images boosts the popularity of tweets and authors, and extends 
the lifespan of tweets. 


Sentiment analysis in social images. Sentiment analysis of visual data 
has received so far less attention than that of text data and, in fact, only 
a few small datasets exist, such as the International Affective Picture Sys- 
tem (IAPS) (Lang et al., 1999) and the Geneva Affective Picture Database 
(GAPED) (Dan-Glauser and Scherer, 2011). The former provides ratings of 
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emotion (in terms of pleasure, arousal and dominance) for 369 images, while 
the latter provides 520 images associated to negative sentiment, 89 neutral 
and 121 positive images. Another related direction is given by works on 
aesthetics: surveys are provided in (Wang and He, 2008; Joshi et al., 2011). 
However, none of these datasets deal with social media. 

A few works have addressed the problem of multimedia sentiment anal- 
ysis of social network data. Borth et al. (Borth et al., 2013) have recently 
presented a large-scale visual sentiment ontology and associated set of de- 
tectors, consisting of 3,244 pairs of nouns and adjectives (ANP), based on 
Plutchik's Wheel of Emotions (Plutchik, 2001). Detectors are trained using 
Flickr images, represented using a combination of global (e.g. color his- 
togram and GIST) and local (e.g. LBP and BoW) features. The paper 
provides also two publicly available image datasets obtained from Flickr 
and from Twitter. The system proposed in (Cao et al., 2014) for the clas- 
sification of Sina Weibo statuses exploits the ANP detectors proposed in 
(Borth et al., 2013), fusing them with text sentiment analysis based on 3 
features: i) sentiment words from Hownet (Chinese equivalent to Word- 
Net), ii) semantic tags and iii) rules of sentence construction, to cope with 
rhetorical questions, negations and exclamatory sentences. 

Cross-media bag-of-words, combining bag of text words with bag of im- 
age words obtained from the SentiBank detectors of (Borth et al., 2013), has 
been proposed in (Wang, Cao, Li, Li and Ji, 2014) for sentiment analysis 
of microblog messages obtained from Sina Weibo. Yang et al. (Yang, Cui, 
Zhu, Zhao, Shi and Yang, 2014) have proposed a hybrid link graph for im- 
ages of social events, weighting links based on textual emotion information, 
visual similarity and social similarity. A ranking algorithm to discover emo- 
tionally representative images in microblog statuses is then presented. The 
work of Chen et al. (Chen, Chen, Hsu, Liao and Chang, 2014), distinguishes 
between the intended publisher effect and the sentiment that is induced in 
the viewer (‘viewer affect concept’) and aims at predicting the latter. The 
goals are to recommend appropriate images and suggest image comments. 


6.3 The Proposed Method 


Recent works have shown (Mikolov et al., 2011) that neural network based 
language models significantly outperform N-gram models; similarly, the use 
of neural networks to learn visual features and classify images has shown 
that they can achieve state-of-the-art results on several standard datasets 
and international competitions (Krizhevsky et al, 2012). The proposed 
method builds on these advances. 

We start by describing the well-known text based approach Continuous 
Bag-Of- Words (CBOW) model (Mikolov, Yih and Zweig, 2013) that is the 
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Figure 6.2: Visualization of CBOW word vectors trained on tweets of the SemEval-2013 
dataset. Blue points are single words classified as negative, while red ones are positive. 
Semantically similar words are near (e.g. ‘crashing’ and ‘crashed’, ‘better’ and ‘best’) and 
share the same polarity. 


base of our scheme, then we present our model for polarity classification 
problem. Finally, we show a further extension of the model to incorporate 
visual information, based on a Denoising Autoencoder (Vincent et al., 2008), 
that allows the same unsupervised capabilities on images as CBOW-based 
methods on text. 


6.3.1 Textual information 


Mikolov et al. (Mikolov, Yih and Zweig, 2013) showed that in the CBOW 
model, words with similar meaning are mapped to similar positions in a 
vector space. Thus, distances may carry a meaning, allowing to formu- 
late questions in the vector space using simple algebra (e.g. the result of 
vector(‘king’) - vector(‘man’) + vector(‘woman’) is near vector(‘queen’)). 
Another property is the very fast training, that allows to exploit large-scale 
unsupervised corpora such as web sources (e.g . Wikipedia). 


Continuous Bag-Of-Words model. In this framework, each word is mapped 
to a unique vector represented by a column in a word matrix W of Q length. 
Every column is indexed by a correspondent index from a dictionary Vr. 
Given a sequence of words w1, W2,..., Wg, CBOW model with hierarchical 
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softmax aims at maximizing the average log probability of predicting the 
central word w given the context represented by its M-window of words, 
i.e. the M words before and after w: 


K-M 


K `> log p(w;|wi ar, , ica Wiss Urna) (6.1) 
t=M 


The output f € RI! for the model is defined as: 
T 
Jas e [WW ar, Wis Wess Wie] G (6.2) 


where W; is the column of W corresponding to the word w; and G € R^*IVrl, 
Both W and G are considered as weights and have to be trained, resulting 
in a dual representation of words. Typically the columns of W are taken 
as final word features. An output probability is then obtained by using the 
softmax function on the output of the model: 


ehe 
p(wilwcontex) = Vi efe (6.3) 
where Weontext = (Wi-m;---5Wt-1, Wi41,---,;Witm)- When considering a 


high number of labels, it can be computed more efficiently by employ- 
ing a hierarchical variation (Mnih and Hinton, 2009), requiring to evaluate 
log,(|Vr|) words instead of |Vr|. 

In (Mikolov, Yih and Zweig, 2013), an additional task named Negative 
Sampling is considered, where a word w is to be classified as related to the 
given context or not, i.e. p(w7|Weontext ): 


Un, = (Wim... Wi, Wim] Na) (6.4) 


where N, € R? and ø is the logistic function. Depending on w as the actual 
w; word or a randomly sampled one, uw, has a target value of respectively 
1 or 0. 


The CBOW-LR method. Our model, denoted as CBOW-LR, is an exten- 
sion of CBOW with negative sampling, specialized on the task of sentiment 
classification. An important difference from approaches that directly use a 
CBOW representation, or from (Turian et al., 2010), is that our model learns 
representation and classification concurrently. Considering that multi-task 
learning can improve neural networks performance (Turian et al., 2010), 
the idea is to use two different contributions accounting for semantic and 
sentiment polarity, respectively. 

Given a corpus of tweets X where each tweet is a sequence of words 
wi, W2,..., Wg, we aim at classifying tweets as positive or negative, and 
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learn word vectors W € R9*!Vr! with properties related to the sentiment 
carried by words, while retaining semantic representation. Semantic rep- 
resentation can be well-represented by a CBOW model, while sentiment 
polarity has limited presence or is lacking. Note that polarity supervision 
is limited and possibly weak, thus a robust semi-supervised setting is pre- 
ferred: on the one hand, a model of sentiment polarity can use the limited 
supervision available, on the other hand the ability to exploit a large corpus 
of unsupervised text, like CBOW, can help the model to classify previously 
unseen text. This is explicitly accounted in our model by considering two 
different components: 

i) inspired by (Mikolov, Yih and Zweig, 2013), we consider a feature 
learning task on words by classifying sentiment polarity of a tweet. A tweet 
is represented as a set of M-window of words that we denote as G. Each 
window G is represented as a sum of their associated word vectors W;, and 
a polarity classifier based on logistic regression is applied accordingly: 


y(G) = e(C7( a Wi) + bs) (6.5) 


Here the notation W; < w; € G refers to selecting the i-th column of 
W by matching the ww; word from G. The matrix C € R® and the vector 
b, € R are parameters of a logistic regression, while a binary cross entropy 
is applied as loss function for every window G. This is applied for every 
tweet T labeled with 77 in the training set and results in the following cost: 


Cu = Y, Y. Gr log(yG) - (1- Gr) log(1—y(G))) (6.6) 


(Tgr) GET 


However, differently from a standard logistic regression, the represen- 
tation matrix W is also a parameter to be learned. A labeled sentiment 
dataset is required to learn this task. 

ii) we explicitly represent semantics by adding a task similar to negative 
sampling, without considering the hierarchical variation. The idea is that 
a CBOW model may also act as a regularizer and provide an additional 
semantic knowledge of word context. Given a window C, a classifier has to 
predict if a word w fits in it. To this end, an additional cost is added: 


Gucci x» Ao ee (6.7) 


T GC€T(ri,wi)c 


where 7 is a set of words w; with their associated target r;, derived from 
a training text sequence. This is the core of negative sampling: JF always 
contains the correct word w; for the considered context G (r; = 1) and K —1 
random sampled words from Vr (rı = 0). It is indeed a sampling as K < 
|Vr.| ^ 1 of the remain wrong words. Note that differently from the previous 
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task, this is unsupervised, not requiring labeled data; moreover tweets can 
belong to a different corpus than that used in the previous component. This 
allows to perform learning on additional unlabeled corpora, to enhance word 
knowledge beyond that of labeled training words. 

Finally, concurrent learning is obtained by forging a total cost, defined 
by the sum of the two parts, opportunely weighted by a A € [0,1], and 
minimized with SGD: 


CcbBow-LR = À È C aont RE (1 V A) E Csem (6.8) 


Fig. 6.2 visualizes the word vectors learned by our model. Note the 
tendency of separating the opposite polarities and the fact that similar words 
are close to each other. 

At prediction time, for each word in a tweet T' we consider its M-window 
G and we compute (6.5) for each window, summing the results: 


Pred(T) =, (wo) = 05) (6.9) 


GET 


If Pred(T) < 0 the tweet is labeled as negative, otherwise it is considered 
positive. It is worth noticing that at prediction time the method does not 
consider a word as positive or negative in its own, but it uses also its context 
to classify its sentiment and how strong it is. Thus the same word can be 
classified differently if used in different contexts. 


6.3.2 Textual and Visual Information 


The CBOW-LR model presented in Sect. 6.3.1 can be extended to account 
for visual information, such as that of images associated to tweets or sta- 
tus messages. Popular image representations are the Visual Bag-Of-Words 
Model (Grauman and Darrell, 2005; Lazebnik et al., 2006; Li, Mei, Kweon 
and Hua, 2011), Fisher Vector (Perronnin, Liu, Sánchez and Poirier, 2010) 
and its improved version (Perronnin, Sánchez and Mensink, 2010; Baec- 
chi et al, 2014). However, as shown recently in (Chatfield et al., 2014; 
Krizhevsky et al., 2012), neural network based models have been shown to 
widely outperform these previous models. So, to fit with the CBOW repre- 
sentation discussed in the previous section, we choose to exploit the images 
by using a representation similar to the one used for the textual informa- 
tion, i.e. a representation obtained from the whole training set by means of 
a neural network. Moreover, likewise for the text, unsupervised learning can 
be performed. For these reasons, inspired also by works such as (Vincent 
et al., 2008), we choose to extend our network with a single-layer Denoising 
Autoencoder, to take its middle level representation as our image descrip- 
tor. As for the textual version, the inclusion of this additional task allows 
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our method to concurrently learn a textual representation and a classifier 
on text polarity and its associated image. 


Denoising Autoencoder. In general, an Autoencoder (also called Autoas- 
sociator (Bengio, 2009)) is a kind of neural network trained to encode the 
input into some representation (usually of lower dimension) so that the in- 
put can be reconstructed from that representation. For this type of network 
the output is thus the input itself. Specifically, an Autoencoder is a net- 
work that takes as input a K-dimensional vector x and maps it to a hidden 
representation h through the mapping: 


h=o(P. x+ be) (6.10) 


where c is the sigmoid function (but any other non-linear activation 
function can be used), P, and b, are respectively a matrix of encoding 
weights and a vector of encoding biases. At this point, h is the coded 
representation of the input, and has to be mapped back to x. This second 
part is called the reconstruction z of x (being z of the same dimension and 
domain of x). In this step a similar transformation as in Eq. 6.10 is used: 


z=0(Pa h+ ba) (6.11) 


where P; and bg are respectively a matrix of decoding weights and a vector 
of decoding biases. One common choice is to constrain P; = PT; in this 
configuration the Autoencoder is said to have ‘tied weights’. The motivation 
for this is that tied weights are used as a regularizer, to prevent the Autoen- 
coder to learn the identity matrix when the dimension of the hidden layer 
is big enough to memorize the whole input; another important advantage 
is that the network has to learn fewer parameters. With this configuration, 
Eq. (6.11) becomes: 


2=0(P? h+ ba) (6.12) 


Learning is performed by minimizing the cross-entropy between the in- 
put x and the reconstructed input z: 


K 


L(z,z)2— yy c log zy + (1 — xx) log (1— 2) (6.13) 


k=1 


using stochastic gradient descent and backpropagation. 

In this scenario h is similar to a lossy compression of x, that should cap- 
ture the coordinates along the main directions of variation of x. To further 
improve the network, the input x can be ‘perturbed’ to another slightly 
different image, č, so that the network will not adapt too much to the given 
inputs but will be able to better generalize over new samples. This forms 
the Denoising variant of the Autoencoder. To do this, the input is corrupted 
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Figure 6.3: The process of polarity prediction of a tweet with its associated image per- 
formed by our model. On the left, one tweet text window (in red) at a time is fed into 
the CBOW model to get a textual representation. Likewise, the associated image is fed 
into the denoising autoencoder (DA). The two representations are concatenated and a 
polarity score for the window is obtained from the logistic regression (LR). Finally, each 
window polarity is summed into a final tweet polarity score. 


by randomly setting some of the values to zero (Bengio, 2009). This way 
the Denoising Autoencoder will try to reconstruct the image including the 
missing parts. Another benefit of the stochastic corruption is that, when 
using a hidden layer bigger than the input layer, the network does not learn 
the identity function (which is the simplest mapping between the input and 
the output) but instead it learns a more useful mapping, since it is trying 
to also reconstruct the missing part of the image. 


The CBOW-DA-LR method. The model used to deal with textual and 
visual information, denoted as CBOW-DA-LR, is an extension of CBOW- 
LR with the addition of a new task based on a Denoising Autoencoder (DA) 
applied to images, aiming at obtaining a mid-level representation. In this fi- 
nal form, the descriptor obtained from the DA, together with the continuous 
word representation, represents the new descriptor for a window of words in 
a tweet and is concurrently used to learn a logistic regressor. Given a tweet, 
for each window, we compute the continuous word representation and the 
image descriptor associated with the tweet. Each window in a tweet will 
be associated with the same image descriptor as the image for the tweet is 
always the same. 

Fig. 6.3 shows an exemplification of the prediction process for a tweet 
with its accompanying image. While the image gets a fixed representation 
for the entire process, the text is represented one window at a time through 
a sliding window process. Each window is processed independently to get a 
local polarity score. To get the overall tweet polarity, each window polarity 
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is summed into a final score and classified according to its sign. 
'This can be formalized as follows: if we define hg as the encoding of the 
image associated to the window G of the tweet T', then Eq. (6.5) becomes: 


«o - «(ent Y, wi (09) +0.) (6.14) 


Wiew,€EG 


where || is the concatenation operator, i.e. the encoded representation 
of the image is concatenated to the continuous word representation of the 
window, forming a new vector whose size is the sum of the size of the 
continuous word space and the size of the encoding representation of the 
image. 

As stated before, the Autoencoder can be pre-trained in the same fashion 
as the continuous word representation. Any set of unlabeled images can be 
used to train the network before the actual training on the tweets. 

The DA will be a component of our model and, like the two previous 
components CBOW and LR, it has its own cost function. Similar to Eq. 
(6.13), it is: 


K 


CRC (a log & + (1 — žų) log (1— x)) (6.15) 


k=1 


Since we are aiming at concurrent learning the textual and image rep- 
resentations, the three components are combined together in a single final 
cost of CBOW-DA-LR. Starting from the previously defined Eq. (6.8) for 
CBOW and Eq. (6.7) for LR, the cost becomes: 


CcBow-DA-LR = di . pm + da . (ocu ae Àa . Ciad (6.16) 


where 1,2, Ag weight the contribution of each task. The model can be 
trained by minimizing Ccpow-pa-Lr with stochastic gradient descend. Sym- 
bolic derivatives can be easily obtained by using an automatic differentiation 
algorithm (e.g. Theano (Bastien et al., 2012)). After training, Eq. (6.9) can 
be used to predict the label of the tweet in the same manner as it is used 
when we do not consider the image descriptor. 


6.4 Experiments 


The datasets. To evaluate the proposed approach we have used four datasets 
obtained from Twitter: 

i) Sanders Corpus’, consists of 5,513 manually labelled tweets on 4 topics 
(Apple, Google, Microsoft and Twitter). Of these, after removing missing 


^http://sananalytics.com/lab/twitter-sentiment/ 
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tweets, retweets ad duplicates, only 3,625 remain. The dataset does not 
specify a train and a test subset, so to evaluate the performance the whole 
set is randomly divided multiple times into subsets each time each one with 
the same size and the mean performance is reported; 

ii) Sentiment140? (Go et al., 2009) consists of a 1.6 million tweet train- 
ing set collected and weakly annotated by querying positive and negative 
emoticons, considering a tweet positive if it contains a positive emoticon like 
“ 1) "and negative if, likewise, it contains a negative emoticon like “ :( "; the 
dataset also comprises a manually annotated test set of 498 tweets obtained 
querying names of products, companies and people; 

iii) SemEval-2013° provides a training set of 9,684 tweets of which only 
8,208 are not missing at the time of writing and a test set of 3,813 tweets, 
selected querying a mixture of entities, products and events; the dataset is 
part of the SemEval-2013 challenge for sentiment analysis and also comprises 
of a development set of 1,654 (of which only 1,413 available at the time 
of writing) that can be used as an addendum to the training set or as a 
validation set; 

iv) SentiBank Twitter Dataset", consists of 470 positive and 133 negative 
tweets with images, related to 21 topics, annotated using Mechanical Turk; 
the dataset has been partitioned by the authors into 5 subsets, each of 
around 120 tweets with the respective images, to be used for a 5-fold cross- 
validation. 


In this work we consider the binary positive/negative classification, thus 
we have removed neutral/objective tweets from the corpora when necessary. 
This approach follows that of (Go et al., 2009) and (Liu et al., 2012), and is 
motivated by the difficulty to obtain training data for this class; it has to be 
noted that even human annotators tend to disagree whether a tweet has a 
negative/positive polarity or it is neutral (Jiang et al., 2011). Performance 
is reported in terms of Accuracy. The evaluation for SemEval is performed 
using Pi, since this is the metric originally used in this dataset. 

For the Sanders dataset, as described earlier, there is no definition of an 
actual test set nor of a training set. For these reasons we choose to follow 
the experimental setup of (Liu et al., 2012), where experiments on Sanders 
dataset have been performed varying the number of training tweets between 
32 to 768. For each test, first the number of training tweets is selected, then 
half of them are randomly chosen from all the positive tweets and the other 
half are chosen from the negative ones. Finally, the remaining tweets are 
used as test set. Since there could be some variation from a random set to 
another, for each test 10 different runs are evaluated and the mean is taken 


Shttp://help.sentiment140.com/for-students 
Shttp://www.cs.york.ac.uk/semeval-2013/task2/ 
Thttp://www.ee.columbia.edu/1n/dvmm/vso/download/sentibank. html 
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as the result of the selected test. Results with this dataset are reported 
with the notation “Sanders@n”, where n is the number of training tweets 
selected. 

'The evaluation of the SentiBank dataset has been performed preserving 
the structure given by the authors so that the results could be comparable. 
'The dataset is divided into 5 subsets for 5-fold cross-validation. Each at a 
time a subset is considered as test set while the other 4 are considered as 
training set; 5 runs are performed and in the end the mean of the 5 results 
is computed and considered the resulting value given by the method for 
the dataset. Considering the high imbalance between positive and negative 
tweets of this dataset we report also the F} score in addition to Accuracy. 

We have evaluated the proposed method through a set of 5 experiments: 
in the first one we evaluate the performance of the proposed CBOW-LR text 
model comparing it against the standard CBOW model. Then we assess the 
performance of these models after pre-training them with large scale Twitter 
corpora. In a third experiment we compare the proposed approach against 
a baseline and two state-of-the-art methods. In the final experiment we 
compare the proposed CBOW-DA-LR text--image model against a state- 
of-the-art method on a publicly available dataset composed by tweets with 
images. In all these experiments we empirically fixed K = 5 and Q = 100. In 
the last experiment we evaluate the effects of K and Q parameters w.r.t. the 
classification performance an all the datasets. Regarding A in the first three 
experiments and A,,A5,A3 in the last one, we tested several combinations 
and found a good setting by fixing A = 0.5 and A, = Ag = As = 0.33, 
respectively. Also the image DA was implemented with ‘tied weights’ to 
reduce overfitting. Its dimensionality was tested in the range [200, 1000] and 
found it better performing by fixing it to 500. 'To perform the optimization 
using stochastic gradient descent, we employed Theano (Bastien et al., 2012) 
to automatically compute the derivatives. 


Exp. 1: Comparison with baselines. Tab. 6.1 compares our proposed 
method (CBOW-LR) with two baselines: RAND-LR and CBOW+SVM. 
The purpose is twofold: i) since we are learning features crafted for the 
specific task, we compare our method with randomly generated features. 
RAND-LR learns a logistic regression classifier on random word features 
(i.e. we set A = 1 in eq. 6.8); ii) we verify the superiority of CBOW-LR 
learned features against a standard unsupervised CBOW representation. 
The CBOW-+SVM baseline employs SVM with standard pre-trained CBOW 
representation on the specific dataset. 

Performance figures show that the proposed method consistently out- 
performs both baselines, thus our method learns useful representations with 
some improvement over CBOW. 
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(proposed) 


Dataset CBOW-LR RAND-LR CBOW+SVM 
Sentiment140 83.01 61.56 19.39 
SemEval-2013 (F1) 72.57 53.01 71.32 
Sanders @ 32 62.55 58.38 59.89 
Sanders @ 256 74.91 63.69 67.91 
Sanders @ 768 82.69 65.53 73.03 


Table 6.1: Comparison between our method and two baselines. Performance is reported 
in terms of accuracy except for SemEval-2013, where is used the Fi measure. Sanders@n 
indicates the number of training tweets used for the experiments on that dataset. 


Exp. 2: Exploiting CBOW training on large scale data. Tab. 6.2 com- 
pares our proposed method with two baselines when exploiting large scale 
training data for the CBOW representation. We pre-trained a CBOW model 
using the 1.6 million tweets of Sentiment140 and used the learned features 
(termed CBOWs) with two standard learning algorithms. CBOWs+LR em- 
ploys the logistic regression while CBOWs+SVM uses the SVM classifier. 
In contrast to the baselines, our model CBOW<-LR employs the pre-trained 
CBOWSs features as initialization for the W matrix. Comparing Tab. 6.2 
with Tab. 6.1 shows that CBOWs+SVM baseline benefit from the use of 
pre-learned CBOWs. This is visible especially on the Sanders dataset, as 
more rich representation is built. Note that when CBOWSs--SVM is applied 
to Sentiment140 dataset it corresponds to CBOW+SVM, since CBOWg 
description is trained on Sentiment140; therefore the result is the same. 
While both CBOWs+SVM and CBOWs+LR are unable to modify the 
word vector representation, our model CBOWs-LR is able to retain the full 
richness of the initial representation and improve it on two datasets. 


Exp. 3: Comparison with FSLM and ESLAM. In this experiment we 
have compared both textual variants of our approach, one with CBOW 
trained using the dataset on which the method is applied and one using 
CBOWs, with two state-of-the-art methods: FSLM and ESLAM, proposed 
in (Liu et al., 2012). FSLM uses a fully supervised probabilistic language 
model, learned concatenating all the tweets of the same class to form syn- 
thetic documents. ESLAM extends FSLM exploiting noisy tweets, based 
on the presence of ‘positive’ and ‘negative’ emoticons, to smooth the lan- 
guage model. Inclusion of manually labelled data with the unsupervised 
noisy data gives the power to deal with unforeseen text that is not eas- 
ily handled by fully supervised methods. Fig. 6.4 shows the Accuracy while 
varying the number of training tweets of the Sanders dataset. The proposed 
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(proposed) 


Dataset CBOWs-LR CBOWs+LR CBOWs+SVM 
Sentiment 140 83.84 76.32 79.39 
Semeval-2013 (F1) 72.23 73.73 71.48 
Sanders @ 32 66.28 66.90 66.65 
Sanders @ 256 76.33 71.14 73.69 
Sanders @ 768 82.98 75.43 76.44 


Table 6.2: Comparison between our method and two baselines, using an initialization 
based on CBOW pre-trained aside with 1.6 million tweets of Sentiment140. Performance 
is reported in terms of accuracy except for SemEval-2013, where is used the F, measure. 
Sanders@n indicates the number of training tweets used for the experiments on that 
dataset. 


approach has a much lower performance when using only 32 or 64 tweets 
for training. However, it can be observed that as the number of training 
data increases so does the performance of the proposed method, that out- 
performs that of ESLAM when using 768 tweets for training. In general the 
proposed method outperforms FSLM. The fact that ESLAM outperforms 
the proposed method when using smaller training data can be explained by 
the fact that CBOW models, as Skip-Gram and feature learning methods, 
require large training datasets. 
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Figure 6.4: Comparison between our method with FSLM and ESLAM (Liu et al., 2012) 
on Sanders dataset, while varying the number of training tweets. 
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Data Method SentiBank (AC) SentiBank (Fi) 
Random 4T 42 
SentiStrength (Thelwall et al., 2010) 58 51 
Text CBOW+SVM 72 50 
(proposed) 
CBOW-LR 75 52 
SentiBank (Borth et al., 2013) 71 51 
Image (proposed) 
DA-LR 69 51 
SentiStrength (Thelwall et al., 2010) + 
SentiBank (Borth et al., 2013) 72 n.a. 
Text+Image - 
proposed) 
CBOW-DA-LR 79 57 


Table 6.3: Comparison between our method (on single and combined modalities) with 
baselines and state-of-the-art approaches on SentiBank Twitter Dataset. 


Exp. 4: Exploiting textual and visual data. In this experiment we have 
evaluated the performance of three versions of our proposed approach — 
CBOW-LR for text, DA-LR for visual data, and CBOW-DA-LR for both 
text and visual information — with different baselines and state-of-the-art 
approaches. 

CBOW-LR has been compared with SentiStrenght (Thelwall et al., 2010) 
and the CBOW 4-SVM baseline used in Exp. 1 and Exp. 2. DA-LR has been 
compared with SentiBank (Borth et al., 2013) classifiers. CBOW-DA-LR 
has been compared with the approach proposed by the authors of the Sen- 
tiBank Twitter dataset (Borth et al., 2013), that uses SentiStrenght (Thel- 
wall et al., 2010) APP? for text classification and SentiBank classifiers as 
mid-level visual features, with a logistic regression model. As the dataset is 
imbalanced, we also compare these approaches with an additional baseline 
based on random classification, i.e. we assign a random polarity to each 
test tweet. We used the code provided by the authors of the methods, 
except for the SentiStrenght+SentiBank case, for which we report the re- 
sult published in (Borth et al., 2013). Results reported in Tab. 6.3 show 
that not only CBOW-LR outperforms both the baseline and SentiStrenght, 
but also the multimodal SentiStrenght--SentiBank approach. When us- 
ing only visual information SentiBank obtains a better performance than 
DA-LR. Considering the text+image case it can be observed that the pro- 
posed multimodal CBOW-DA-LR method improves upon single modalities 
(CBOW-LR and DA-LR) and outperforms SentiStrenght+SentiBank by a 
larger margin, proving that images hold meaningful informations regarding 
the polarity of text, and thus can be exploited to improve overall Accuracy 
and F}. 


8http:/ /sentistrength.wlv.ac.uk/ 
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Exp. 5: Parameters analysis. Fig. 6.5 shows accuracy and F} of our model 
when varying K and Q parameters on Sanders, SemEval-2013 and Senti- 
ment140 datasets. The performance on SentiBank is practically not affected 
by these parameters. The same set of parameters results in the best per- 
formance on all the datasets. The values of K and Q are in line with those 
obtained to train CBOW models on Wikipedia by Mikolov et al. . 
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Figure 6.5: Performance of the proposed method when varying K and Q parameters on 
Sanders, SemEval-2013 and Sentiment140 datasets. 


6.5 Conclusions 


In this chapter we have presented a method for sentiment analysis of social 
network multimedia, presenting an unified model that considers both textual 
and visual information. 

Regarding textual analysis we described a novel semi-supervised model 
CBOW-LR, extending the CBOW model, that learns concurrently vector 
representation and a sentiment polarity classifier on short texts such as that 
of tweets. Our experiments show that CBOW-LR can obtain improved 
accuracy on polarity classification over CBOW representation on the same 
quantity of text. When considering a large unsupervised corpus of tweets as 
additional training data for CBOW, a further improvement is shown, with 
our model being able to improve the overall accuracy. Comparison with the 
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state-of-the-art methods FSLM and ESLAM shows promising results. 

The CBOW-LR model has been expanded to account for visual informa- 
tion using a Denoising Autoencoder. The unified model (CBOW-DA-LR) 
works in an unsupervised and semi-supervised manner, learning text and 
image representation, as well as the sentiment polarity classifier for tweets 
containing images. The unified CBOW-DA-LR model has been compared 
with SentiBank, a state-of-the-art approach on a publicly available Twitter 
dataset, obtaining a higher classification accuracy. 
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Chapter 7 
Popularity Prediction with Sentiment and Context Features 


Images in social networks share different destinies: some are go- 
ing to become popular while others are going to be completely 
unnoticed. In this chapter we propose to use visual sentiment 
features together with three novel context features to predict a 
concise popularity score of social images. Experiments on large 
scale datasets show the benefits of proposed features on the perfor- 
mance of image popularity prediction. Exploiting state-of-the-art 
sentiment features, we report a qualitative analysis of which sen- 
timents seem to be related to good or poor popularity. 


7.1 Introduction 


In the last decade users of social networks such as Flickr and Facebook have 
uploaded tens of billions of photos, often adding accompanying metadata by 
tagging and by providing a short description. Users interact with each other 
by forming groups of shared interests, following the status streams of each 
other, and by commenting the photos that have been shared. Inevitably, 
in the huge quantity of available media, some of these images are going 
to become very popular, while others are going to be totally unnoticed 
and end up in oblivion. Often, media may be popular because it conveys 
sentiments or it has a rich meaning in the social context it is put. In 
fact, sentiments have been known to affect popularity of visual media since 
the widespread watch of television programs(Diener and DeFour, 1978). 
Also, it was recently found to be related to popularity in tweets (Bae and 
Lee, 2012). Being able to predict the popularity of a media may have a 
profound impact on several essential applications such as content retrieval 
and annotation, but also in other fields such as advertising and content 


IParts of the work presented in this chapter have been published in Gelli, F., Uricchio, T., 
Bertini, M., Del Bimbo, A., and Chang, S. F. (2015, October). “Image popularity prediction in 
social media using sentiment and context features". In Proceedings of the 23rd ACM international 
conference on Multimedia (pp. 907-910). ACM. The publication is available at http://dx.doi. 
org/10.1145/2733373.2806361. 
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Figure 7.1: The idea of our approach to popularity prediction of images. 


distribution (Figueiredo et al., 2014). 

In this chapter, we address the problem of predicting the popularity of 
an image posted in a social network, considering different scenarios that 
are typical of different situations. Despite the recent crop of literature that 
studies the question of what makes an image popular (Khosla et al., 2014; 
Totti et al., 2014; McParlane et al., 2014), none of these works addresses the 
question of how much the visual sentiment is influencing the popularity of 
media. As social context has been widely found important to predict media 
popularity (Khosla et al., 2014), we show how to further improve popularity 
estimation by using a knowledge base to supplement the understanding of 
semantics in textual metadata. 

'The main contributions of this chapter are: 


e we propose to employ state-of-the-art visual sentiment features (Borth 
et al., 2013; Chen, Borth, Darrell and Chang, 2014) to perform image 
popularity prediction; 


e we propose three new textual features based on a knowledge base, to 
better model the semantic description of an image, in addition to the 
social context features proposed in (Khosla et al., 2014; McParlane 
et al., 2014); 


e we show qualitative results of which sentiments seem to be related to 
a good or poor popularity. 


To the best of our knowledge, this is the first work understanding specific vi- 
sual concepts that positively or negatively influence the eventual popularity 
of images, beyond just numerical prediction of photo popularity. 
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Experiments performed on large scale datasets illustrate several benefits 
of the two types of proposed features, and show how their combination 
impacts effectively on the performance of popularity prediction. 


7.2 Related work 


Popularity Prediction Recently, a significant effort has been spent on in- 
vestigating popularity of social media content. Regarding image popularity, 
the majority of works agree that social features have the greatest predic- 
tive power (Khosla et al., 2014; McParlane et al., 2014; Totti et al., 2014). 
Visual content features are less powerful than social ones in terms of pre- 
dictive power, but they are useful when no user metadata is present (e.g. no 
tags or description) or to address scenarios such as the case in which no so- 
cial interactions have been recorded before posting the image (e.g. because 
the user has just joined the social network). Previous works vary in terms 
of popularity score definition (e.g. image views, reshares, mean views over 
a period) but they all share the same basic pipeline: they extract several 
content and context related features and successively employ a regressor to 
compute the popularity score. 

In (Khosla et al., 2014), Khosla et al. investigate both low-level features 
such as color, GIST, LBP, and content features such as the object pre- 
dictions and network activations of a state-of-the-art CNN image classifier 
(Krizhevsky et al., 2012). Together with user and image context features, 
they show promising results. McParlane et al. (McParlane et al., 2014) 
propose to use image content, context features and user context to predict 
popularity. Their analysis is limited to a cold start scenario, i.e. where there 
exist no or little textual or interaction data. Totti et al. (Totti et al., 2014) 
investigate the use of aesthetics features such as blur, aspect ratio and color 
channel statistics together with the output of 85 object classifiers as content 
features. 


Visual Sentiment A few works have addressed the problem of multime- 
dia sentiment analysis of social network images. Starting from the 24 ba- 
sic emotions of Plutchik's Wheel of Emotions (Plutchik, 2001), Borth et 
al. (Borth et al., 2013) have recently presented a large-scale visual senti- 
ment ontology termed SentiBank. They train 3,244 detectors on pairs of 
nouns and adjectives (ANPs) based on a combination of global and local 
features. Based on the recent breakthrough of convolutional networks for 
classification (Krizhevsky et al., 2012), Chen et al. (Chen, Borth, Darrell 
and Chang, 2014) used a CNN to replace SVM in the approach of Borth et 
al. (Borth et al., 2013), obtaining an improved accuracy on ANPs. 

The authors in (Chen, Yu, Chen, Cui, Chen and Chang, 2014) proposed 
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an hierarchical system able to handle sentiment concept classification and lo- 
calization on objects. They found individual concept detector of SentiBank 
(Borth et al., 2013) less reliable for object-based concepts. 

Chen et al. (Chen, Chen, Hsu, Liao and Chang, 2014) studied the corre- 
lation between the intended publisher sentiment and the actual induced in 
the viewer (‘viewer affect concept’). They aim to recommend appropriate 
images for the publisher by predicting in advance the induced sentiment in 
the viewer. 


7.3 The Proposed Method 


Our proposed method is based on two hypotheses: i) the popularity of an 
image can be fueled by the inherent visual sentiments conveyed; ii) semantic 
descriptions of an image is also important for its popularity, since it makes 
it easier to be found or looked at. 


7.3.1 Measuring Popularity 


It is difficult to precisely define a single score as measure of popularity, and 
several ways have been proposed to measure it. Khoshla et al. (Khosla et al., 
2014) used the number of views on Flickr as the principal metric. McParlane 
et al. (McParlane et al., 2014) consider both the number of views and the 
number of comments for each image as they have been found correlated in 
video popularity (Chatzopoulou et al., 2010). However they only aim to 
predict two classes of popularity: high or low. 

In this work we follow Khoshla et al. (Khosla et al., 2014) and consider 
the number of views on Flickr as popularity metric. To cope with the large 
variation of views, we divide the popularity metric by the difference of time 
between the user upload and our retrieval, then we apply the log function. 


7.3.2 Visual Sentiment Features 


'To discover which visual emotions are roused from the visualization of an 
image, a visual sentiment concept classification is performed based on the 
Visual Sentiment Ontology (VSO). The ontology, consisting in a collec- 
tion of 3,244 Adjective-Noun-Pairs (ANPs), has been defined by Borth et 
al. (Borth et al., 2013). In particular we used DeepSentiBank (Chen, Borth, 
Darrell and Chang, 2014): a convolutional neural network pre-trained from 
(Krizhevsky et al., 2012) has been fine-tuned to classify images in one of a 
subset of 2,096 ANPs. Similarly to its previous version (Borth et al., 2013), 
this tool provides a mid-level representation of an image. 

For each image we extract two descriptors that we term respectively 
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SentANPs and FeatANPs: the ANPs prediction layer of 2,096d and the 
rectified activations of the 7! fully connected layer of 4,096d. 


7.3.3 Object Features 


Since image popularity is related also to the visual content of the image, 
we extract the convolutional neural networks features, initially proposed in 
(Khosla et al., 2014). A very deep CNN with 16 layers (Chatfield et al., 
2014) was used to extract for each image the final output containing 1,000 
objects from ILSVRC 2014 challenge (termed ObjOut) and the 4,096d rep- 
resentation of the 7‘” rectified fully connected layer (termed ObjFC7). 


7.3.4 Context Features 


Image context information such as tags and description contains important 
cues that may reflect on the number of views that an image obtains. En- 
tities like people, locations or tourist attractions can affect popularity as 
i) people may be more interested in photographs referring some particular 
subject; ii) the presence of tags and description, the submission of a photo 
to some groups, etc. make it easier to be found by other users. The extrac- 
tion of entities from image context strongly depends on the nature of the 
text, i.e. tags and textual description; due to the different nature of these 
channels, two different approaches are proposed. 


Entity Extraction from Tags Starting from image tags, we define two new 
context features that we term TagType and TagDomain. They both rely 
on Freebase, a large collaborative ontology containing millions of intercon- 
nected topics. Given a tag, a search for a Freebase topic is performed: if 
the tag is related to some topics, the most popular one is picked, accord- 
ing to Freebase popularity ranking. Meaningless tags that do not have a 
match in Freebase topics are ignored, thus they do not act as a nuisance. 
When a Freebase topic is retrieved, another query is performed to extract its 
Freebase types with the “notable” property and its Freebase domain. While 
types are mostly specific (e.g. Person, Author) domains cover broader areas 
(e.g. Film, Music). 

Due to the vast number of types in the ontology, a smaller specific type 
knowledge base is introduced. We first randomly sampled 10k tags from 
MIR-Flickr dataset vocabulary (Huiskes et al., 2010) and used them to 
extract Freebase types. We select the 100 most frequent types as our specific 
knowledge base. 

The extraction of TagType feature for an image is then straightforward: 
each tag is used to query Freebase for a notable type. We count the matches 
to the 100 selected types and obtain a 100d histogram as final feature. 
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Regarding the TagDomain feature, we take the full list of 78 domains 
pre-defined by Freebase curators and count the tag matches, similarly as 
TagType. Thus, the eventual TagDomain feature result in a 78d histogram. 


Entity Extraction from description  Differently from the concise tags, im- 
age descriptions allow users to comprehensively detail their images in natu- 
ral language. We seek to recognize subjects and objects of this text to detail 
context. Hence, we adopt a well known CRF-based language model to per- 
form Named Entity Recognition (NER) (Finkel et al., 2005). We used the 
pre-trained 7-class model for MUC that is able to recognize Time, Location, 
Organization, Person, Money, Percent, Date. We count the occurrences for 
each class and build a 7d feature that we term NER;. 


7.3.5 User Features 


Previous works have found that the number of views that a photograph 
is going to obtain depends not only on the image itself and its context 
information, but also on the author data. In this work we used the same 
user features proposed by Khosla et al. (Khosla et al., 2014): among these 
features the most related one to popularity is the mean views of the images 
of the user, as it represents the popularity of the user himself. 


7.3.6 Popularity prediction 


In order to predict popularity as a concise score, we used an off-the-shelf 
Support Vector Machine. As we are working with large-scale dataset, we 
used a L2 regularized L2 loss Support Vector Regression (SVR) from LI- 
BLINEAR package due to its scalability with large sparse data and huge 
number of instances compared to a kernelized version. 


7.4 Experiments 


As different scenarios show different aspects of popularity, we structure 
our experimental setups similarly to those of Khosla et al. (Khosla et al., 
2014), using Flickr social network. Two datasets were used to represent two 
different scenarios: 


e One-Per-User (OPU): we randomly selected 250k images from the 
VSO Flickr Dataset (Borth et al., 2013). This dataset represents the 
scenario of a Flickr search, where images belong to different users. 


e User Specific (US): 25 users from the VSO Flickr Dataset are selected 
at random to constitute 25 different trials. For each one, 10k images 
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are randomly selected. This dataset represent the scenario of a user 
that wants to select which of his pictures should be uploaded to attract 
the attention of other users. 


In each experiment, we extract and concatenate the selected features. 
We freely provide the extracted features on our website. Multidimensional 
features are L2 normalized, while scalar attributes are scaled in the [0, 1] 
range. We split every dataset in training and evaluation: half was randomly 
chosen as training set, while the remaining images were equally split in 
validation and testing set. The C of SVM was set in the range [0.001 — 100]. 

After the prediction, testing images are ranked in descending popularity 
scores and compared to the correct ranking obtained by the ground truth 
scores. The correlation between these two lists r and s is computed using 
Spearman’s rank correlation that ranges in [—1, 1]: 


2 dati — F)(Si — 3) 
Vlr ES DEDIC E 
a score of 1 (or -1) corresponds to perfect (inverse) correlation, while 0 
corresponds to random ranks. 


p (7.1) 


7.4.1 Results 


Experiments have been carried out for visual features, context ones and 
visual + context + user combination. We train a model with each single 
feature to show its predictive power. Then, we combine the features and 
compare a model with all of them against baselines implemented following 
the method of Khosla et al. (Khosla et al., 2014) i.e. without our novel 
features. Results are reported in terms of Spearman’s rank correlation and, 
for the User Specific dataset, the average scores between the 25 users are 
reported. 


Visual Features Visual content features include visual sentiment and ob- 
ject detections (Sec. 7.3.2, 7.3.3). The latter ones are used in this case as a 
baseline, including ObjOut and ObjFC7. 


Dataset | SentANPs | FeatANPs | ObjOut | ObjFC7 || Baseline | All 
OPU 0.28 0.32 0.13 0.30 0.30 0.36 
US 0.31 0.40 0.27 0.40 0.40 0.43 


Table 7.1: Visual Features Results 


Results are reported in Table 7.1: sentiment features are comparable 
with object features. As ANPs are learned starting from a similar network 
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for classification, this suggests the existence of some correlation between 
them. Nevertheless, SentANPs is higher than ObjOut, suggesting that 
ANPs are better for popularity prediction than purely object classification. 
Our features are able to improve overall prediction in both scenarios. 


Context Features The performance of the proposed context features (Sec. 7.3.4) 
is compared with a baseline composed by the number of tags, the length of 
title and description (Table 7.2). 


Dataset | TagType | TagDomain | NER; | TagNum | TitleLen | DescLen || Baseline | All 
OPU 0.42 0.36 0.50 0.55 0.22 0.48 0.61 0.63 
US 0.44 0.37 0.13 0.23 0.17 0.20 0.33 0.54 


Table 7.2: Context Features Results 


Our features are comparable with other context-based ones in the OPU 
scenario. In the US scenario, all the features except TagType and Tag- 
Domain lose predictive power due to the limited context of a single user. 
'This is because our features are able to better model semantically the single 
photos, regardless of the single user. When combined, our feature boost 
correlation to 0.54 from 0.33 of the baseline. 


Visual + Context + User In this experiment we combined visual, con- 
text and user features along with the total combination with and without 
our novel features. User features are added to resemble a state of the art 
pipeline. Each modality is singularly tested and finally combined together. 
Results are reported in Table 7.3. Note that User Features can't be used 
for the User Specific scenario as each model is trained for a single user. 


Dataset | Method Visual Content | Image Context | User Features || All 
OPU proposed 0.36 0.63 0.72 0.76 
baseline | 0.30 0.61 0.72 0.74 
US proposed | 0.43 0.54 n/a 0.61 
baseline | 0.40 0.33 n/a 0.50 


Table 7.3: Visual + Context + User Features Results 


User Features produce the highest correlation in the OPU scenario, con- 
firming that popularity is highly related to the popularity of the author 
(Khosla et al., 2014). Despite this, the combination of the three modalities 
is helpful, boosting correlation from 0.72 to 0.74. Our features further im- 
prove upon this, bringing the value to 0.76. In the User Specific dataset, 
the improvement from the baseline is more pronounced, where a correlation 
of 0.61 vs 0.50 is obtained. 


126 


Tiberio Uricchio 


7.4.2 Qualitative Analysis 


We investigate which specific ANP and semantic metadata correlated the 
most with the number of views of images. This analysis is performed for the 
One-Per-User scenario, as it aims to be as generic as possible. Fig. 7.2(a) 
shows the trained SVR weights for each of the 2089 ANPs, in descend- 
ing order. According to the figure we split the visual sentiments in three 
categories. 

A first group include those ANPs that have a positive impact on image 
popularity (e.g. sexy legs, beautiful eyes, heavy rain). The rapid drop evinces 
that a very short number of ANPs corresponds to strongly popular images in 
the training dataset. Then, we observe that some visual sentiments obtain 
very low weights, near zero: that ANPs are almost irrelevant to the number 
of views (e.g. sunny trees, dry forest). Finally a third group includes ANPs 
that are associated to a sufficiently negative score: the detection of those 
push an image towards unpopularity (e.g. creepy eyes, silly clown). 

Extending our analysis to the 28 basic emotions of the Plutchick wheel, 
we found out that our model marked as unpopular those images that arouse 
emotions such as annoyance or serenity, while high scores are likely to be 
returned in case of sentiments as amazement or ecstasy. These last emotions 
derive from ANPs containing the adjective sexy, resulting in 10 occurrences 
in the top 35 visual emotions. A similar analysis on the 100 semantic en- 
tities is shown in Fig. 7.2(b). This plot has a similar trend compared with 
that of visual sentiment, but for the extreme values: in this case the nega- 
tively weighted types (e.g. religious practice and software genre) have more 
prominent values than the positively weighted ones (e.g. garment and film 
character). 


7.5 Conclusions 


In this chapter we proposed to employ state-of-the-art visual sentiment fea- 
tures and three new context features to address the problem of predicting 
whether an image posted on a social network may became popular. We 
are the first to show a qualitative analysis of which sentiments (as ANPs) 
are correlated to popularity. Our experiments suggest that some sentiments 
have a correlation with popularity, still smaller than user features. How- 
ever, together with our novel context features, they have good prediction 
power, especially when user features are unavailable as in the User Specific 
scenario. 
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Figure 7.2: Influence of Multimedia Concepts on Popularity: weights of the 2089 ANP 
visual sentiment concepts (top); weights of the 100 Freebase Types extracted from con- 


textual image tags (bottom). 
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Chapter 8 
Conclusion 


This chapter summarizes the contribution of the thesis and dis- 
cusses avenues for future research. 


8.1 Summary of Contribution 


After presenting a structured survey of related work on social tagging and 
retrieval, we detailed a novel experimental protocol that we used to test and 
analyze eleven key methods. Established the state of the art, we proposed 
several models and methods to achieve objective annotation of images. Fi- 
nally we moved to subjective annotation of sentiments aroused in a viewer 
and the expected popularity of an image. 

In particular, we first presented in Chapter 2 a survey on image tag 
assignment, refinement and retrieval, with the hope of illustrating connec- 
tions and difference between the many methods and their applicabilities, 
and consequently helping the interested audience to either pick up an exist- 
ing method or devise a method of their own given the data at hand. Based 
on the key observation that all works rely on tag relevance learning as the 
common ingredient, exiting works, which vary in terms of their methodolo- 
gles and target tasks, are interpreted in a unified framework. Consequently, 
a two-dimensional taxonomy has been developed, allowing us to structure 
the growing literature in light of what information a specific method ex- 
ploits and how the information is leveraged in order to produce their tag 
relevance scores. 

Having established the common ground between methods, a new experi- 
mental protocol was introduced in Chapter 3 for a head-to-head comparison 
between the state-of-the-art. A selected set of eleven representative works 
were implemented and evaluated for tag assignment, refinement, and/or 
retrieval. 

Nearest neighbors methods proved to be the best overall performing 
method for assignment in Chapter 3. Hence, we proposed a novel technique 
in Chapter 4 that reduce the semantic gap in that class of methods. We pre- 
sented a cross-media model based on KCCA for tag assignment. The key 


Tiberio Uricchio, Image Understanding by Socializing the Semantic Gap, ISBN 978-88-6453-576-0 (print), 
ISBN 978-88-6453-577-7 (online) © 2017 Firenze University Press 


Image Understanding by Socializing the Semantic Gap 


idea was learning a semantic space, where visual and textual data where 
represented as blended unified features. This representation is able to pro- 
vide better neighbors for nearest neighbor algorithms. The experimental 
results showed that our method makes consistent improvements over stan- 
dard approaches based on a single-view visual representation as well as other 
previous work that also exploited tags. The properties of tested methods 
found in Chapter 3 remain still valid in the semantic space, although with 
an improved capability of retrieving better neighbors. Hence a better per- 
formance is obtained. 

Considering the influence of real world events in tagging behavior, in 
Chapter 5 we briefly analyzed the correlations between user tags, news and 
the objective relevance of concepts. The results suggest that analyzing the 
time series of tags may be beneficial to annotate social media. 

Moving on to subjective information extraction, in Chapter 6 and 7 we 
explored the related tasks of sentiment analysis in tweets and the popularity 
estimation of images in social networks. In Chapter 6 we have presented 
a method for sentiment analysis of social network multimedia, capable of 
learning both textual and visual features in an unified fashion. Our model 
CBOW-LR, extending the CBOW model, learns concurrently a vector rep- 
resentation and a sentiment polarity classifier on short texts. Comparing 
to previous work, our representation explicitly includes the sentiment of 
words and maintains good performance. By adding images to the mix, 
a further extension CBOW-DA-LR was presented. This semi-supervised 
model concurrently learns text and image representation, as well as the sen- 
timent polarity classifier for tweets containing images. Experiments with 
large unsupervised corpus of tweets show promising results compared to the 
state-of-the-art. 

Chapter 7 presented a novel approach to predict whether an image 
posted on a social network may became popular. The approach uses a 
combination of state-of-the-art visual sentiment features and three novel 
context features to reduce the semantic gap. The experiments reported 
suggest that some sentiments have a correlation with popularity. Moreover, 
our novel context features have good prediction power, especially when user 
features are unavailable. We also presented the first study that show a qual- 
itative analysis of which sentiments (as ANPs) are correlated to popularity. 


8.2 Direction of future work 


Much remains to be done. Several exciting recent developments open up 
new opportunities for the future. First, extraction of objective informa- 
tion can profit from recent developments of deep learning. Employing novel 
deep learning based visual features is likely to boost the performance of an- 
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notations method that employ visual features. What is scientifically more 
interesting is to devise a learning strategy that is capable of jointly exploit- 
ing tag, image, and user information in a much more scalable manner than 
currently feasible. The importance of the filter component, which refines 
socially tagged training examples in advance to learning, is underestimated. 
Having a number of collaboratively labeled resources publicly available, re- 
search on joint exploration of social data and these resources is important. 
'This connects to the most fundamental aspect of content-based image re- 
trieval in the context of sharing and tagging within social media platforms: 
to what extent a social tag can be trusted remains open. Image retrieval by 
multi-tag query is another important yet largely unexplored problem. For 
a query of two tags, it is suggested to view the two tags as a single bi-gram 
tag (Li et al., 2012; Nie et al., 2012; Borth et al., 2013), which is found to 
be superior to late fusion of individual tag scores. Nonetheless, due to the 
increasing sparseness of n-grams, how to effectively answer generic queries 
of more than two tag is challenging. Exploiting further modalities remain 
still a largely unexplored area of research. In Chapter 5 we investigated the 
correlation of tags with the ground truth and events gathered from news by 
considering the time dimension. Although of limited scope, the study found 
that objective tags have a strong correlation to both content and context, 
giving a promising direction for improving content understanding. Possi- 
ble extensions of this work include the exploration of how richer textual 
and semantic cues from natural language annotations might improve our 
models. Compared to extracting objective information, subjective informa- 
tion extraction is still young and full of exciting directions. We are still far 
from getting reliable estimations of sentiments in visual content. Current 
features are handcrafted on psychological or empirical studies but they are 
inherently affected by the semantic gap. Automatically learning features 
alike to approaches used in deep learning could bring considerable improve- 
ments in recognizing feelings despite the hard interpretability of filters. We 
barely scratched the surface in Chapter 6. Similarly, the prediction of pop- 
ularity is still relying in basic handcrafted features. Although the social 
network aspects are well known to be related to popularity, visual content 
and context analysis is still needed when aiming to maximize popularity of 
a content. Àn underestimated factor is the peculiarity of different cultures 
in having different values and thus interest and feelings. Social networks 
can provide a world playground for study these aspects. 


We see contributions of this field as essential to other related fields such 
as that of computer vision and artificial intelligence. The last two years were 
marked by a surge of deep convolutional models that showed remarkable im- 
provement on vision tasks such as object recognition and image captioning. 
However, their limit is related to the strong supervision they need for train- 
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ing. Due to the cost of scaling these approaches, we expect an increased 
interest in unsupervised and semi-supervised learning, ultimately reaching 
social networks as an essential source of media. 


“One way to resolve the semantic gap comes from sources outside the 
image ...", Smeulders et al. wrote at the end of their seminal paper (Smeul- 
ders et al., 2000). While what such sources would be was mostly unknown 
by that time, it is now becoming evident that the many images shared 
and tagged in social media platforms are promising to resolve the semantic 
gap. By adding new relevant tags, refining the existing ones or directly ad- 
dressing retrieval, the access to the semantic of the content has been much 
improved. This is achieved only when appropriate care is taken to attack 
the unreliability of social tagging. 
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'This research activity has led to several publications in international jour- 
nals and conferences. These are summarized below.! 
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